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Abstract 


A multithreaded program with a bug may behave nondeterministically, and this nondeter- 
minism typically makes the bug hard to localize. This thesis presents a debugging tool, the 
Nondeterminator-2, which automatically finds certain nondeterminacy bugs in programs 
coded in the Cilk multithreaded language. Specifically, the Nondeterminator-2 finds “dag 
races,” which occur when two logically parallel threads access the same memory location 
while holding no locks in common, and at least one of the accesses writes the location. 

The Nondeterminator-2 contains two dynamic algorithms, ALL-SETS and BRELLY, 
which check for dag races in the computation generated by the serial execution of a Cilk 
program on a given input. For a program that runs serially in time T,, accesses V shared 
memory locations, uses a total of n locks, and holds at most k < n locks simultaneously, 
ALL-SETS runs in O(n*T a(V,V)) time and O(n*V) space, where a is Tarjan’s functional 
inverse of Ackermann’s function. The faster BRELLY algorithm runs in O(kT a(V,V)) time 
using O(kV) space and can be used to detect races in programs intended to obey the 
“umbrella” locking discipline, a programming methodology that precludes races. 

In order to explain the guarantees provided by the Nondeterminator-2, we provide a 
framework for defining nondeterminism and define several “levels” of nondeterministic pro- 
gram behavior. Although precise detection of nondeterminism is in general computationally 
infeasible, we show that an “abelian” Cilk program, one whose critical sections commute, 
produces a determinate final state if it is deadlock free and if it can generate a dag-race free 
computation. Thus, the Nondeterminator-2’s two algorithms can verify the determinacy of 
a deadlock-free abelian program running on a given input. 

Finally, we describe our experiences using the Nondeterminator-2 on a real-world ra- 
diosity program, which is a graphics application for modeling light in diffuse environments. 
With the help of the Nondeterminator-2, we were able to speed up the entire radiosity 
application 5.97 times on 8 processors while changing less than 5 percent of the code. The 
Nondeterminator-2 allowed us to certify that the application had no race bugs with a high 
degree of confidence. 
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Chapter 1 


Introduction 


When parallel programs have bugs, they can be nondeterministic, meaning that dif- 
ferent executions produce different behaviors. In this thesis, we present a debugging 
tool, the Nondeterminator-2, which automatically finds nondeterminacy bugs in par- 
allel programs. We give a theoretical model of nondeterminism that precisely explains 
the guarantees provided by the Nondeterminator-2. We further demonstrate the ef- 
fectiveness of this debugging tool by showing how it was used to parallelize a complex, 


real-world application. 


Nondeterminism 


Because of the vagaries of timings of multiple processors, parallel programs can be 
nondeterministic. Nondeterminism poses a serious challenge for debugging, because 
reproducing the situation that caused a particular bug can be difficult. Also, verifying 
that a program works correctly in one scheduling does not preclude the possibility of 
bugs in future executions. 

In this thesis, we develop techniques for debugging parallel programs coded in 
the Cilk language. The Cilk [3, 4, 7, 16, 23] project is designed to make it easy for 
programmers to write efficient parallel programs. Parallel computing has long been 
an area of research, but it has yet to reach the “mainstream” world of professional 


programmers, even though parallel machines are becoming more available. Tradi- 


9 


int x; cilk void foo1( 


{ 
cilk int main() x += 2; 
{ Be 
x = 2; 
spawn foo1(); cilk void foo2() 
spawn foo2(); { 
printf ("%d", x); x *= 3; 
return 0; } 


is 


Figure 1-1: A nondeterministic Cilk program. The spawn statement in a Cilk program 
creates a parallel subprocedure, and the sync statement provides control synchronization 
to ensure that all spawned subprocedures have completed. 


Temporary values 
Processor 1 Processor 2 tmp 


tmp, — x 


‘Time 


tmp2 + x 


tmp, <« tmp, + 2 


tmp2 «+ tmp2 * 3 
c+ tmp, 


ARR RW D | 
DOOM DNW | 


c+ tmp2 


Figure 1-2: An example of the machine instructions comprising updates to a shared 
variable x being interleaved. The final value of x in this particular execution is 6. 


tional techniques for parallelization typically require programmers to have intimate 
knowledge of the workings of their parallel architectures. Cilk alleviates this problem 
by allowing programmers to code in the Cilk language, which is a simple extension to 
the programming language C [24]. The Cilk runtime system then automatically and 
efficiently runs this code on multiprocessor machines. 

Cilk programs can still have nondeterminacy bugs, however. Figure 1-1 shows a 
Cilk program that behaves nondeterministically. The procedures foot and foo2 run 
in parallel, resulting in parallel access to the shared variable x. The value of x printed 
by main is 12 if fool happens to run before foo2, but it is 8 if foo2 happens to run 
before foot. Additionally, main might also print 4 or 6 for x, because the statements 
in foo1 and foo2 are composed of multiple machine instructions that may interleave, 


possibly resulting in a lost update to x. 
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int x; cilk void foo1( 


Cilk_lockvar A; { 
Cilk_lock(A); 
cilk int main() x += 2; 
{ Cilk_unlock (A); 
x = 2; } 
Cilk_lock_init (A); 
spawn fool(); cilk void foo2() 
spawn foo2(); { 
printf("%d", x); Cilk_lock(A); 
return 0; x *= 3; 
} Cilk_unlock (A); 
} 


Figure 1-3: A Cilk program that incorporates user-level locking to produce atomic crit- 
ical sections. Locks are declared as Cilk_lockvar variables, and must be initialized by 
Cilk_lock_init() statements. The function Cilk_lock() acquires a specified lock, and 
Cilk_unlock() releases a lock. 


Figure 1-2 shows an example of this interleaving occurring. Processor 1 performs 
the x += 2 operation at the “same time” as processor 2 performs the x *= 3 opera- 
tion. ‘The individual machine instructions that comprise these operations interleave, 
producing the value 6 as the final value of x. 

This behavior is likely to be a bug, but it may be the programmer’s intention. It is 
also possible that the programmer intended 8 or 12 to be legal final values for x, but 
not 4 or 6. This behavior could be legitimately achieved through the use of mutual- 
exclusion locks. A lock is a language construct, typically implemented as a location 
in shared memory, that can be acquired and released but that is guaranteed to be 
acquired by at most one thread at once. In other words, locks allow the programmer 
to force certain sections of the code, called critical sections, to be “atomic” with 
respect to each other. Two operations are atomic if the instructions that comprise 
them cannot be interleaved. Figure 1-3 shows the program in Figure 1-1 with locks 
added. In this version, the value of x printed by main may be either 8 or 12, but 
cannot be 4 or 6. 

The program in Figure 1-3 is nondeterministic, but it is somehow “less nonde- 
terministic” than the program in Figure 1-1. Indeed, while Figure 1-3 uses locks 


to “control” nondeterminism, the locks themselves are inherently nondeterministic, 
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int x; cilk void foo1( 


Cilk_lockvar A; { 
Cilk_lockvar B; Cilk_lock(A); 
x += 2; 
cilk int main() Cilk_unlock (A); 
{ } 
x = 2; 
Cilk_lock_init (A); cilk void foo2() 
Cilk_lock_init (B); { 
spawn fool(); Cilk_lock(B); 
spawn foo2(); x *= 3; 
printf("%d", x); Cilk_unlock (B) ; 
return 0; } 
} 


Figure 1-4: A Cilk program that uses locks but that still contains a data race. The 
distinct locks A and B do not prevent the updates to x from interleaving. 


because the semantics of locks is that any of the threads trying to acquire a lock 
may in fact be the one to get it. In fact, it is arguable that any Cilk program is 
nondeterministic, because memory updates happen in different orders depending on 


scheduling. 


Rather than attempt to discuss these issues with such ambiguity, we present a 
formal model for defining nondeterminism. Under this model, we can precisely define 
multiple kinds of nondeterminism. In particular, we define the concept of a data 
race: intuitively, a situation where parallel threads could update (or update and 
access) a memory location “simultaneously.” Figure 1-1 contains a data race, whereas 
Figure 1-3 does not. It should be noted, however, that the mere presence of locks does 
not preclude data races. It still necessary to use the right locks in the right places. 
Figure 1-4 shows an example where locks have been used (presumably) incorrectly. 
The two distinct locks A and B do not have any effect on each other, so the updates 


to x once again may interleave in a data race. 


Data races may not exactly represent the form of nondeterminism that program- 
mers care about. Data races are likely to be bugs, however, and they are interesting 
because they are a form of nondeterminism that we can hope to detect automatically. 


By automatically detecting data races, we can provide debugging information to the 
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alana 
Cilk_lock(A) X+=2 Cilk_unlock(A) 


e—-e_- ~@ 


x=2 Cilk_lock_init(A) Cilk_lock_init(B) 


NX printf("%d" , x) 
eo -. -@ 


Cilk_lock(B) xX*=3 Cilk_unlock(B) 


Figure 1-5: The computation dag for the program in Figure 1-3. A dag race exists 
between the two highlighted nodes. 


programmer that is of great use when trying to track down nondeterminacy bugs. 

Unfortunately, even detection of data races is computationally too difficult to be 
done in a practical debugging tool. Instead, the Nondeterminator-2 detects “dag 
races.” Roughly, a dag race race is like a data race, but the question of whether 
two memory accesses could occur simultaneously is approximated. We say that an 
execution of a Cilk program generates a computation, which is a directed acyclic 
graph (dag) where the nodes represent the instructions of the program and the edges 
represent the parallel control constructs. The dag for Figure 1-4 is shown in Figure 1- 
5.1 The dag is an approximation of other possible executions of the same program on 
the same input; that is, we consider the possible executions of the program on that 
input to be the topological sorts of the dag in which each lock is held at most once at 
any given time. A dag race, then, occurs when two instructions that are unrelated 
in the dag both access the same memory location, at least one of the accesses is a 
write, and no common lock is held across both of the accesses. Figure 1-5 has a dag 
race between the two highlighted nodes. 


As we shall see, the dag races that the Nondeterminator-2 detects are not the 


!This picture of the dag is a simplification; a formal method for construction of the dag is given 
in Chapter 7. 
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same thing as data races. Nonetheless, experience shows that dag races are a good 
enough approximation to be useful to report as debugging information to the program- 
mer. Furthermore, we show that for some set of programs, the dag races precisely 
correspond to data races. One such set of programs is the “abelian” programs in 
which all critical sections protected by the same lock “commute”: intuitively, the 
critical sections produce the same effect regardless of scheduling. We show that if 
a (deadlock-free) abelian program generates a computation with no dag races, then 
the program is determinate: all schedulings produce the same final result. The 
consequence, therefore, is that for an abelian program, the Nondeterminator-2 can 
verify the determinacy of the program on a given input. 

The Nondeterminator-2 cannot provide such a guarantee for nonabelian programs. 
Even for such programs, however, we expect that reporting dag races to the user 
provides a useful debugging heuristic. Indeed, this approach has been implicitly 


taken by all previous dynamic race-detection tools. 


Race-detection algorithms 


In previous work, some efforts have been made to detect data races statically (at 
compile-time) [31, 42]. Static debuggers have the advantage that they sometimes can 
draw conclusions about the program for all inputs. Since understanding any nontrivial 
semantics of the program is generally undecidable, however, most race detectors are 
dynamic tools in which potential races are detected at runtime by executing the pro- 
gram on a given input. Some dynamic race detectors perform a post-mortem analysis 
based on program execution traces [12, 21, 29, 32], while others perform an “on- 
the-fly” analysis during program execution. On-the-fly debuggers directly instrument 
memory accesses via the compiler [10, 11, 14, 15, 28, 36], by binary rewriting [39], or 
by augmenting the machine’s cache coherence protocol [30, 37]. 

In this thesis, we present two race detection algorithms which are based on the 
Nondeterminator [14], a tool that finds dag races in Cilk programs that do not use 
locks. The Nondeterminator executes a Cilk program serially on a given input, main- 


taining an efficient “SP-bags” data structure to keep track of the logical series/parallel 
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relationships between threads. For a Cilk program that runs serially in time T and ac- 
cesses V shared-memory locations, the Nondeterminator runs in O(T a(V,V)) time 
and O(V) space, where a is Tarjan’s functional inverse of Ackermann’s function, 
which for all practical purposes is at most 4. 

The Nondeterminator-2, the tool presented here, finds dag races in Cilk programs 
that use locks. This race detector contains two algorithms, both of which use the same 
efficient SP-bags data structure from the original Nondeterminator. The first of these 
algorithms, ALL-SETS, is an on-the-fly algorithm that, like most other race-detection 
algorithms, assumes that no locks are held across parallel control statements, such 
as spawn and sync. The second algorithm, BRELLY, is a faster on-the-fly algorithm, 
but in addition to reporting dag races as bugs, it also reports as bugs some complex 
locking protocols that are probably undesirable but that may be race free. 

The ALL-SETS algorithm executes a Cilk program serially on a given input and 
either detects a dag race in the computation or guarantees that none exist. For a 
Cilk program that runs serially in time T’,, accesses V shared-memory locations, uses 
a total of n locks, and holds at most k < n locks simultaneously, ALL-SETS runs 
in O(n'T a(V,V)) time and O(n*V) space. Tighter, more complicated bounds on 
ALL-SETS are given in Chapter 2. 

In previous work, Dinning and Schonberg’s “Lock Covers” algorithm [11] also 
detects all dag races in a computation. The ALL-SETS algorithm improves the Lock 
Covers algorithm by generalizing the data structures and techniques from the original 
Nondeterminator to produce better time and space bounds. Perkovic and Keleher [37] 
offer an on-the-fly race-detection algorithm that “piggybacks” on a cache-coherence 
protocol for lazy release consistency. Their approach is fast (about twice the serial 
work, and the tool runs in parallel), but it only catches races that actually occur 
during a parallel execution, not those that are logically present in the computation. 

Although the asymptotic performance bounds of ALL-SETS are the best to date, 
they are a factor of n* larger in the worst case than those for the original Nonde- 
terminator. The BRELLY algorithm is asymptotically faster than ALL-SETS, and its 


performance bounds are only a factor of k larger than those for the original Nondeter- 
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minator. For a Cilk program that runs serially in time 7, accesses V shared-memory 
locations, and holds at most k locks simultaneously, the serial BRELLY algorithm 
runs in O(kT a(V,V)) time and O(kV) space. Since most programs do not hold 
many locks simultaneously, this algorithm runs in nearly linear time and space. ‘The 
improved performance bounds come at a cost, however. Rather than detecting dag 
races directly, BRELLY only detects violations of a “locking discipline” that precludes 


dag races. 


A locking discipline is a programming methodology that dictates a restriction 
on the use of locks. For example, many programs adopt the discipline of acquiring 
locks in a fixed order so as to avoid deadlock [22]. Similarly, the “umbrella” locking 
discipline precludes dag races by requiring that each location be protected by the 
same lock within every parallel subcomputation of the computation. Threads that 
are in series may use different locks for the same location (or possibly even none, if 
no parallel accesses occur), but if two threads in series are both in parallel with a 
third and all access the same location, then all three threads must agree on a single 
lock for that location. If a program obeys the umbrella discipline, a dag race cannot 
occur, because parallel accesses are always protected by the same lock. ‘The BRELLY 


algorithm detects violations of the umbrella locking discipline. 


Savage et al. [39] originally suggested that efficient debugging tools can be devel- 
oped by requiring programs to obey a locking discipline. Their Eraser tool enforces a 
simple discipline in which any shared variable is protected by a single lock throughout 
the course of the program execution. Whenever a thread accesses a shared variable, it 
must acquire the designated lock. This discipline precludes dag races from occurring, 
and Eraser finds violations of the discipline in O(kT) time and O(kV) space. (These 
bounds are for the serial work; Eraser actually runs in parallel.) Eraser only works 
in a parallel environment containing several linear threads, however, with no nested 
parallelism or thread joining as is permitted in Cilk. In addition, since Eraser does 
not recognize the series/parallel relationship of threads, it does not properly under- 
stand at what times a variable is actually shared. Specifically, it heuristically guesses 


when the “initialization phase” of a variable ends and the “sharing phase” begins, 
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and thus it may miss some dag races. 

In comparison, our BRELLY algorithm performs nearly as efficiently, is guaran- 
teed to find all violations, and importantly, supports a more flexible discipline. In 
particular, the umbrella discipline allows separate program modules to be composed 
in series without agreement on a global lock for each location. For example, an appli- 
cation may have three phases—an initialization phase, a work phase, and a clean-up 
phase—which can be developed independently without agreeing globally on the locks 
used to protect locations. If a fourth module runs in parallel with all of these phases 
and accesses the same memory locations, however, the umbrella discipline does re- 
quire that all phases agree on the lock for each shared location. Thus, although the 
umbrella discipline is more flexible than Eraser’s discipline, it is more restrictive than 
what a general dag-race detection algorithm, such as ALL-SETS, permits. 

Figure 1-6 compares the asymptotic performance of ALL-SETS and BRELLY with 
other race detection algorithms in the literature. A more in-depth discussion of this 


comparison is given in Chapter 4. 


Using the Nondeterminator-2 


In addition to presenting the ALL-SETS and BRELLY algorithms themselves, we dis- 
cuss practical issues surrounding their use. Specifically, we explain how they can be 
used when memory is allocated and freed dynamically. We describe techniques for 
annotating code in order to make dag race reports more useful for practical debugging 
purposes. Additionally, we present timings of the algorithms on a few example Cilk 
programs. 

Finally, we present an in-depth case study of our experiences parallelizing a large 
radiosity application. Radiosity is a graphics algorithm for modeling light in diffuse 
environments. Figure 1-7 shows a scene in which radiosity was used to model the 
reflections of light off of the walls of a maze. The majority of the calculation time 
for radiosity is spent calculating certain properties of the scene geometry. These 
calculations can be parallelized, and Cilk is ideally suited for this parallelization, 


because its load-balancing scheduler is provably good, and so can obtain speedup 
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b = total number of threads in the computation 

k = maximum number of locks held simultaneously 


Figure 1-6: Comparison of race detection algorithms. Tighter, more complicated bounds 
are given for ALL-SETS (and Lock Covers) in Figure 4-1. 


even for such irregular calculations. 

Parallelization speedup, however, is not particularly impressive if the same result 
could be achieved by optimizing the serial execution. Therefore, it is usually not 
desirable to rewrite applications for parallel execution, because the optimizations in 
the serial code might be lost. So instead of implementing our own radiosity code, 
we downloaded a large radiosity application developed at the Computer Graphics 
Research Group of the Katholieke Universiteit Leuven, in Belgium [2]. Since the 
code is written in C, and Cilk is a simple extension of C, running the code as a Cilk 


program is effortless. 


18 
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Figure 1-7: A maze scene rendered after 100 iterations of the radiosity algorithm. 


The difficulty, however, is that the program is large, consisting of 75 source files 
and 23,000 lines of code. The code was not written with parallelization in mind, so 
there are portions where memory is shared “unnecessarily.” That is, operations that 
in principle could be independent actually write to the same memory locations. ‘These 
conflicts need to be resolved if those operations are to be run in parallel. Searching 
through the code for such problems would be very tedious. The Nondeterminator-2, 
however, provides a much faster approach. We just run in parallel those operations 
that are “in principle” independent, and use the Nondeterminator-2 to find the places 
in the code where this parallelization failed. In this way, we are directly pointed to 
the problem areas of the code and have no need to examine pieces of the code that 
don’t demonstrate any races. Our resulting Cilk radiosity code achieves a 5.97 times 


speedup on 8 processors. 


Organization of this thesis 


This thesis is organized into three major parts. 

Part I discusses the race-detection algorithms. Chapter 2 presents the ALL-SETS 
algorithm for detecting dag races in a Cilk computation, and Chapter 3 presents 
the BRELLY algorithm for detecting umbrella discipline violations. Chapter 4 then 


gives a comparison of the asymptotic performance of these algorithms with other 


19 


race-detection algorithms in the literature. 

Part II presents our theory of nondeterminism. Chapter 5 presents a framework 
for defining nondeterminism, and data races in particular. Chapter 6 shows that 
precise detection of data races is computationally infeasible. Chapter 7 explains why 
the dag races that are detected by the Nondeterminator-2 are not the same thing as 
data races. Chapter 8, however, defines the notion of abelian programs, and shows 
that there is a provable correspondence between dag races and data races for abelian 
programs. Furthermore, that chapter shows that the Nondeterminator-2 can verify 
the determinacy of deadlock-free abelian programs. A complicated proof of one lemma 
needed for this result is left to Appendix A. 

Finally, in Part II, we discuss some practical considerations surrounding the use 
of the Nondeterminator-2. Chapter 9 discusses how to detect races in the pres- 
ence of dynamic memory allocation and how to reduce the number of “false race 
reports” that the Nondeterminator-2 produces. Timings of our implementation of 
the Nondeterminator-2 are also given in that chapter. Some of the ideas described 
in Chapter 9 were inspired by our experiences parallelizing the radiosity application; 
these experiences are described in Chapter 10. Chapter 11 offers some concluding 
remarks. 

Some of the results in this thesis appear in [6] and represent joint work with 


Guang-len Cheng, Mingdong Feng, Charles E. Leiserson, and Keith Randall. 
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Part I 


Race-Detection Algorithms 


Chapter 2 


The All-Sets Algorithm 


In this chapter, we present the ALL-SETS algorithm, which detects dag races in Cilk 
computations.' We first give some background on Cilk and explain the series-parallel 
structure of its computations. Next we review the SP-BAGS algorithm|14] used by the 
original Nondeterminator. We then we present the ALL-SETS algorithm itself, show 
that it is correct, and analyze its performance. Specifically, we that for a program that 
runs serially in time 7, accesses V shared memory locations, uses a total of n locks, 
and holds at most k <n locks simultaneously, ALL-SETS run in O(n*T a(V, V)) time 
and O(n*V) space, where a is Tarjan’s functional inverse of Ackermann’s function. 
Furthermore, ALL-SETS guarantees to find a dag race in the generated computation 


if and only if such a race exists. 


Cilk 


Cilk is an algorithmic multithreaded language. The idea behind Cilk is to allow 
programmers to easily express the parallelism of their programs, and to have the 
runtime system take care of the details of running the program on many processors. 
Cilk’s scheduler uses a work-stealing algorithm to achieve provably good performance. 
While this feature is not the main focus of this paper, it surfaces again as motivation 


for the radiosity example. 


1Some of the results in this chapter appear in [6]. 
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In order to make it as easy as possible for programmers to express parallelism, 
Cilk was designed as a simple extension to C. A Cilk program is a C program with 
a few keywords added. Furthermore, a Cilk program running on one processor has 
the same semantics as the C program that is left when those keywords are removed. 
Cilk does not require programmers to know a priori on how many processors their 
programs will run. 

The Cilk keyword spawn, when immediately preceding a function call, declares 
that the function may be run in parallel. In other words, the parent function that 
spawned the child is allowed to continue executing at the same time as the child 
function executes. The parent may later issue a sync instruction, which means that 
the parent must wait until all the children it has spawned complete before continuing.” 
Any procedure that spawns other procedures or that itself is spawned must be declared 
with the type qualifier cilk. 

Figure 2-1 gives an example Cilk procedure that computes the nth Fibonacci num- 
ber. The two recursive cases of the Fibonacci calculation are spawned off in parallel. 
The code then syncs, which forces it to wait for the two spawned subcomputations 
to complete. Once they have done so, their results are available to be accumulated 
and returned. 

Additionally, Cilk provides the user with mutual-exclusion locks. A lock is es- 
sentially a location in shared memory that can be “acquired” or “released.” It is 
guaranteed, however, that at most one thread can acquire a given lock at once. The 
command Cilk_lock() acquires a specified lock, and Cilk_unlock() releases a spec- 
ified lock. If the lock is already acquired then Cilk_lock() “spins,” meaning that it 
waits until the lock is released, and then attempts to acquire it again. We assume in 
this thesis, as does the race-detection literature, that parallel control constructs are 
disallowed while locks are held.? 
~ 2The semantics of spawn and sync are similar to that of fork/join, but spawn and sync are 
lightweight operations. 

°The Nondeterminator-2 can still be used with programs for which this assumption does not 
hold, but the race detector prints a warning, and some races may be missed. We are developing 


extensions of the Nondeterminator-2’s detection algorithms that work properly for programs that 
hold locks across parallel control constructs. See [5] for more discussion. 
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cilk int fib(int n) 


int x; 
int y; 


if (m < 2) 
return n; 


x 
y 
sync; 

return (x+y); 


spawn fib(n-1); 
spawn fib(n-1); 


Figure 2-1: A Cilk procedure that computes the nth Fibonacci number. 


The computation of a Cilk program on a given input can be viewed as a directed 
acyclic graph, or dag, in which vertices are instructions and edges denote ordering 
constraints imposed by control statements. A Cilk spawn statement generates a vertex 
with out-degree 2, and a Cilk sync statement generates a vertex whose in-degree is 1 


plus the number of subprocedures syncing at that point. 


We define a thread to be a maximal sequence of vertices that does not contain 
any parallel control constructs. If there is a path in the dag from thread e; to thread 
€2, then we say that the threads are logically in series, which we denote by e; ~ 2. 
If there is no path in the dag between e, and e2, then they are logically in parallel, 
€; || eg. Only the series relation < is transitive. A dag race exists on a Cilk 
computation if two threads e; || e2 access the same memory location while holding 


no locks in common, and at least one of the threads writes the location. 


The computation dag generated by a Cilk program can itself be represented as 
a binary series-parallel parse tree, as illustrated in Figure 2-2. In the parse 
tree of a Cilk computation, leaf nodes represent threads. Each internal node is ei- 
ther an S-node if the computation represented by its left subtree logically precedes 
the computation represented by its right subtree, or a P-node if its two subtrees’ 


computations are logically in parallel. 


A parse tree allows the series/parallel relation between two threads e; and e2 
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int x; 
Cilk_lockvar A, B; 


cilk void fool() { 
Cilk_lock(A); 
Cilk_lock(B) ; 
x += 5; 


cilk void foo3() { 
Cilk_lock(B); 
ETS 
Cilk_unlock(B); 
} 


cilk int main() { 


Cilk_unlock(B); Cilk_lock_init (A); “=0 ~N, 
Cilk_unlock(A); Cilk_lock_init(B); {A,B} P. 
eh xt=5 KH 
} ak (4) tb) 
spawn foo1(); x-=3 0 xt+ 


cilk void foo2() { 
Cilk_lock(A); 
x -= 3; 
Cilk_unlock(A); 


spawn foo2(); 
spawn foo3(); 
sync; 

printf ("%d", x); 


} i 


Figure 2-2: A Cilk program and the associated series-parallel parse tree, abbreviated to 
show only the accesses to shared location x. Each leaf is labeled with a code fragment that 
accesses x, with the set of locks held at that access shown above the code fragment. 


to be determined by examining their least common ancestor, which we denote by 
LCA(€1, €2). If LCA(e1, €2) is a P-node, the two threads are logically in parallel (e, || 
€9). If LCA(e;, 2) is an S-node, the two threads are logically in series: e; < és, 


assuming that e, precedes e2 in a left-to-right depth-first tree walk of the parse tree. 


The original Nondeterminator 


The original Nondeterminator uses the efficient SP-BAGS algorithm to detect dag 
races in Cilk programs that do not use locks. The SP-BAGS algorithm executes a Cilk 
program on a given input in serial, depth-first order. This execution order mirrors 
that of normal C programs: every subcomputation that is spawned executes com- 
pletely before the procedure that spawned it continues. Every spawned procedure* 
is given a unique ID at runtime. These IDs are kept in the fast disjoint-set data 
structure [8, Chapter 22] analyzed by Tarjan [43]. The data structure maintains a 


dynamic collection © of disjoint sets and provides three elementary operations: 
Make-Set(z): H+ UU {{a}}. 


‘Technically, by “procedure” we mean “procedure instance,” that is, the runtime state of the 
procedure. 
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SPAWN procedure F: 
Sp < MAKE-SET(F) 
Pr < {} 


SYNC in a procedure F:: 


Sr < UNION(Sp, Pr) 


RETURN from procedure Ff” to F: 
Pr << UNION(Pp, Spr) 


Figure 2-3: The SP-BaGs algorithm for updating S-bags and P-bags, which are repre- 
sented as disjoint sets. 


Union(X,Y): Ve U—{X,Y}U{X UY}. The sets X and Y are destroyed. 
Find-Set(zx): Returns the set X € © such that 7 € X. 


Tarjan shows that any m of these operations on n sets take a total of O(ma(m,n)) 
time. 

During the execution of the SP-bags algorithm, two “bags” of procedure ID’s are 
maintained for every Cilk procedure on the call stack. These bags have the following 


contents: 


e The S-bag Sp of a procedure F’ contains the ID’s of those descendants of F’s 
completed children that logically “precede” the currently executing thread, as 
well as the ID for F itself. 

e The P-bag Pr of a procedure F' contains the ID’s of those descendants of 
F’s completed children that operate logically “in parallel” with the currently 


executing thread. 


The S-bags and P-bags are represented as sets using the disjoint-set data struc- 
ture. At each parallel control construct of the program, the contents of the bags are 
updated as described in Figure 2-3. To determine the logical relationship of the cur- 
rently executing thread with any already executed thread only requires a FIND-SET 


operation, which runs in amortized a(V,V) time. If the set found is an S-bag, the 


on 


threads are in series, whereas if a P-bag is found, the threads are in parallel. 

In addition, SP-BAGS maintains a shadow space that has an entry correspond- 
ing to each location of shared memory. For a location | of shared memory, the cor- 
responding shadow space entry keeps information about previous accesses to |. This 
information is used to find previous threads that have accessed the same location as 


the current thread. 


The All-Sets algorithm 


The ALL-SETs algorithm is an extension of the SP-BAGS algorithm that detects dag 
races in Cilk programs that use locks. The ALL-SETS algorithm also uses S-bags 
and P-bags to determine the series/parallel relationship between threads. Its shadow 
space lockers is more complex than the shadow space of SP-BAGS, however, because 
it keeps track of which locks were held by previous accesses to the various locations. 

The lock set of an access is the set of locks held by the thread when the access 
occurs. The lock set of several accesses is the intersection of their respective lock 
sets. If the lock set of two parallel accesses to the same location is empty, and at least 
one of the accesses is a WRITE, then a dag race exists. To simplify the description 
and analysis of the race detection algorithm, we shall use a small trick to avoid the 
extra condition for a race that “at least one of the accesses is a WRITE.” The idea 
is to introduce a fake lock for read accesses called the R-LOCK, which is implicitly 
acquired immediately before a READ and released immediately afterwards. The fake 
lock behaves from the race detector’s point of view just like a normal lock, but during 
an actual computation, it is never actually acquired and released (since it does not 
actually exist). The use of R-LOCK simplifies the description and analysis of ALL- 
SETS, because it allows us to state the condition for a dag race more succinctly: if 
the lock set of two parallel accesses to the same location is empty, then a dag race 
exists. By this condition, a dag race (correctly) does not exist for two read accesses, 
since their lock set contains the R-LOCK. 

The entry lockers{l| in ALL-SETS’ shadow space stores a list of lockers: threads 


that access location J, each paired with the lock set that was held during the access. 
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LOCK(A) 
Add A to H 


UNLOCK (A) 
Remove A from H 


ACCESS(/) in thread e with lock set H 
1 for each (e’, H’) € lockers|l] 


2 do if e' ||e and H’N H = {} 
3 then declare a dag race 
4 redundant <— FALSE 
5 for each (e’, H') € lockers{I| 
6 do if e’<eand H'D>H 
7 then lockers{l| < lockers|l] — {(e', H’)} 
8 if e’ ||e and H’ CH 
9 then redundant + TRUE 
10 if redundant = FALSE 
11. = then lockers] < lockers{l] U {(e, H)} 


Figure 2-4: The ALL-SETs algorithm. The operations for the spawn, sync, and return 
actions are unchanged from the SP-BAGS algorithm. 


If (e, H) € lockers{[l], then thread e accesses location | while holding the lock set H. 
location J is accessed by thread e while it holds the lock set H. 
As an example of what the shadow space lockers may contain, consider a thread 


e that performs the following: 


Cilk_lock(A); Cilk_lock(B) ; 
READ(I) 

Cilk_unlock(B); Cilk_unlock(A) ; 
Cilk_lock(B); Cilk_lock(C) ; 
WRITE(L) 


Cilk_unlock(C); Cilk_unlock(B) ; 


For this example, the list lockers{l] contains two lockers—(e, {A,B,R-LOCK}) and 


(e, {B, C}). 
The ALL-SETS algorithm is shown in Figure 2-4. Intuitively, this algorithm 


records all lockers, but it is careful to prune redundant lockers, keeping at most 
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one locker per distinct lock set. Locks are added and removed from the global lock 
set H at Cilk_lock and Cilk_unlock statements. Lines 1-3 check to see if a dag 
race has occurred and report any violations. Lines 5-11 then add the current locker 


to the lockers shadow space and prune redundant lockers. 


Correctness of All-Sets 
Before proving the correctness of ALL-SETS, we restate two lemmas from [14]. 


Lemma 1 Suppose that three threads e,, e2, and e3 execute in order in a serial, 
depth-first execution of a Cilk program, and suppose that e, < eg and e; || e3. Then, 


we have €g || e3. . 


Lemma 2 (Pseudotransitivity of ||) Suppose that three threads e,, €2, and e3 ex- 
ecute in order in a serial, depth-first execution of a Cilk program, and suppose that 


€1 || €2 and eg || es. Then, we have e; || e3. rT 


We now prove that the ALL-SETs algorithm is correct. 


Theorem 3 The ALL-SETS algorithm detects a dag race in a computation of a Cilk 


program running on a given input if and only if a dag race exists in the computation. 


Proof: (=>) To prove that any race reported by the ALL-SETS algorithm really exists 
in the computation, observe that every locker added to lockers{I] in line 11 consists 
of a thread and the lock set held by that thread when it accesses /. The algorithm 
declares a race when it detects in line 2 that the lock set of two parallel accesses (by 
the current thread e and one from lockers{l]) is empty, which is exactly the condition 
required for a dag race. 

(<) Assuming a dag race exists in a computation, we shall show that a dag race 
is reported. If a dag race exists, then we can choose two threads e; and e2 such that 
e€, is the last thread before e2 in the serial execution that has a dag race with eg. If 
we let H, and Hy be the lock sets held by e; and eg, respectively, then we have e; || e2 


and Ay M Hy = ae 
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We first show that immediately after e; executes, lockers{I] contains some thread 
e3 that races with eg. If (e;, Hi) is added to lockers{l] in line 11, then e; is such 
an e€3. Otherwise, the redundant flag must have been set in line 9, so there must exist 
a locker (e3, H3) € lockers|l] with e3 || e; and H3 C H,. Thus, by pseudotransitivity 
(Lemma 2), we have eg || e2. Moreover, since H3; C H, and Hi Hp = {}, we have 
HM Hy = {}, and therefore e3, which belongs to lockers|l], races with eo. 

To complete the proof, we now show that the locker (e3, H3) is not removed from 
lockers|l| between the times that e, and eg are executed. Suppose to the contrary that 
(e4, H4) is a locker that causes (e3, H3) to be removed from lockers[I] in line 7. Then, 
we must have e3 < e, and H3 D Hy, and by Lemma 1, we have ey || e2. Moreover, 
since H; D H, and H3M Hy = {}, we have HyM Hy = {}, contradicting the choice of 
€, as the last thread before e2 to race with eo. 

Therefore, thread e3, which races with eg, still belongs to lockers{l] when eg exe- 


cutes, and so lines 1-3 report a race. : 


Analysis of All-Sets 


In Chapter 1, we claimed that for a Cilk program that executes in time TJ’ on one 
processor, references V shared memory locations, uses a total of n locks, and holds 
at most k < n locks simultaneously, the ALL-SETS algorithm can check this compu- 
tation for dag races in O(n*T a(V,V)) time and using O(n*V) space. These bounds, 


which are correct but weak, are improved by the next theorem. 


Theorem 4 Consider a Cilk program that executes in time T on one processor, ref- 
erences V shared memory locations, uses a total of n locks, and holds at most k locks 
simultaneously. The ALL-SETS algorithm checks this computation for dag races in 
O(TL(k+a(V,V))) time and O(kLV) space, where L is the maximum of the number 


of distinct lock sets used to access any particular location. 


Proof: First, observe that no two lockers in lockers have the same lock set, because 


the logic in lines 5-11 ensure that if H = H', then locker (e, H) either replaces (e’, H') 
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(line 7) or is considered redundant (line 9). Thus, there are at most L lockers in the 
list lockers|l]. Each lock set takes at most O(k) space, so the space needed for lockers 
is O(kLV). The length of the list lockers|l] determines the number of series/parallel 
relations that are tested. In the worst case, we need to perform 2L such tests (lines 2 
and 6) and 2D set operations (lines 2, 6, and 8) per access. Each series/parallel test 
takes amortized O(a(V, V)) time, and each set operation takes O(k) time. Therefore, 
the ALL-SETS algorithm runs in O(TL(k + a(V,V))) time. " 


The looser bounds claimed in Chapter 1 of O(n*T a(V, V)) time and O(n*V) space 
for k <n follow because L < Fy @ = O(n*/k!). As we shall see in Chapter 9, 


however, we rarely see the worst-case behavior given by the bounds in Theorem 4. 
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Chapter 3 


The Brelly Algorithm 


In this section, we formally define the “umbrella locking discipline” and present the 
BRELLY algorithm for detecting violations of this discipline.! We prove that the 
BRELLY algorithm is correct and analyze its performance, which we show to be asymp- 
totically better than that of ALL-SETs. Specifically, we show that for a program that 
runs serially in time 7, accesses V shared memory locations, uses a total of n locks, 
and holds at most ks <n locks simultaneously, BRELLY runs in O(kT a(V,V)) time 
using O(kV) space, where a is Tarjan’s functional inverse of Ackermann’s function. 
We further prove that BRELLY guarantees to find a violation of the umbrella discipline 


in the computation if and only if a violation exists. 


The umbrella discipline 


The umbrella discipline can be defined precisely in terms of the parse tree of a given 
Cilk computation. An umbrella of accesses to a location | is a subtree rooted at a 
P-node containing accesses to / in both its left and right subtrees, as is illustrated in 
Figure 3-1. An umbrella of accesses to | is protected if its accesses have a nonempty 
lock set and unprotected otherwise. A program obeys the umbrella locking dis- 
cipline if it contains no unprotected umbrellas. In other words, within each umbrella 


of accesses to a location /, all threads must agree on at least one lock to protect their 


1Some of the results in this chapter appear in [6]. 
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Figure 3-1: Three umbrellas of accesses to a location 1. In this parse tree, each shaded 
leaf represents a thread that accesses |. Each umbrella of accesses to | is enclosed by a 
dashed line. 

accesses to l. 


The next theorem shows that adherence to the umbrella discipline precludes dag 


races from occurring. 
Theorem 5 A Cilk computation with a dag race violates the umbrella discipline. 


Proof: Any two threads involved in a dag race must have a P-node as their least 
common ancestor in the parse tree, because they operate in parallel. This P-node 
roots an unprotected umbrella, since both threads access the same location and the 


lock sets of the two threads are disjoint. : 


The umbrella discipline can also be violated by unusual, but dag-race free, locking 
protocols. For instance, suppose that a location is protected by three locks and that 
every thread always acquires two of the three locks before accessing the location. 
No single lock protects the location, but every pair of such accesses is mutually 
exclusive. The ALL-SETS algorithm properly certifies this bizarre example as race- 
free, whereas BRELLY detects a discipline violation. In return for disallowing these 
unusual locking protocols (which in any event are of dubious value), BRELLY checks 


programs asymptotically faster than ALL-SETS. 


The Brelly algorithm 


Like ALL-SETS, the BRELLY algorithm extends the SP-BAGS algorithm used in the 
original Nondeterminator and uses the R-LOCK fake lock for read accesses (see Chap- 


ter 2). Figure 3-2 gives pseudocode for BRELLY. Like the SP-BAGS algorithm, 
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LOCK(A) 
Add A to H 


UNLOCK (A) 
Remove A from H 


ACCESS(/) in thread e with lock set H 
1 if accessor|l| < e 
2 then > serial access 
locks|l] < H, leaving nonlocker|h] with its old 

nonlocker if it was already in locks[] but 
setting nonlocker|h] < accessor|l] otherwise 

3 for each lock h € locks{I] 

4 do alive|h] < TRUE 

5 accessor|l] < e 

6 else PD parallel access 

7 for each lock h € locks|l] — H 

8 do if alive[h] = TRUE 


9 then alive[h] < FALSE 
10 nonlocker|h] <— e 
11 for each lock h € locks[I] N H 
12 do if alive[h] = TRUE and nonlocker|h] || e 
13 then alive|h] < FALSE 
14 if no locks in locks[I] are alive (or locks{l] = {}) 
15 then report violation on / involving 

e and accessor|I] 

16 for each lock h € HA locks|I] 
Le do report access to / without h 


by nonlocker|h] 


Figure 3-2: The BRELLY algorithm. While executing a Cilk program in serial depth-first 
order, at each access to a shared-memory location |, the code for ACCESS(/) is executed. 
Locks are added and removed from the lock set H at Cilk_lock and Cilk_unlock state- 
ments. To determine whether the currently executing thread is in series or parallel with 
previously executed threads, BRELLY uses the SP-bags data structure. 
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BRELLY executes the program on a given input in serial depth-first order, maintain- 
ing the SP-bags data structure so that the series/parallel relationship between the 
currently executing thread and any previously executed thread can be determined 
quickly. Like the ALL-SETS algorithm, BRELLY also maintains a set H of currently 
held locks. In addition, BRELLY maintains two shadow spaces of shared memory: 
accessor, which stores for each location the thread that performed the last “serial 
access” to that location; and locks, which stores the lock set of that access. Each 
entry in the accessor space is initialized to the initial thread (which logically precedes 
all threads in the computation), and each entry in the locks space is initialized to the 


empty set. 


Unlike the ALL-SETs algorithm, BRELLY keeps only a single lock set, rather than 
a list of lock sets, for each shared-memory location. For a location 1, each lock in 
locks{l] potentially belongs to the lock set of the largest umbrella of accesses to | 
that includes the current thread. The BRELLY algorithm tags each lock h € locks[I] 
with two pieces of information: a thread nonlocker|h] and a flag alive|h]. The thread 
nonlocker|h] is a thread that accesses | without holding h. The flag alive[h] indicates 
whether h should still be considered to potentially belong to the lock set of the 
umbrella. To allow reports of violations to be more precise, the algorithm “kills” a 
lock h by setting alive[|h] < FALSE when it determines that h does not belong to the 


lock set of the umbrella, rather than simply removing it from locks[I]. 


Whenever BRELLY encounters an access by a thread e to a location /, it checks 
for a violation with previous accesses to 1, updating the shadow spaces appropriately 
for future reference. If accessor|l| < e, we say the access is a serial access, and the 
algorithm performs lines 2—5, setting locks|l] <_ H and accessor|l| < e, as well as 
updating nonlocker|h] and alive{h] appropriately for each h € H. If accessor|l] || e, 
we say the access is a parallel access, and the algorithm performs lines 6-17, killing 
the locks in locks[l] that do not belong to the current lock set H (lines 7-10) or whose 
nonlockers are in parallel with the current thread (lines 11-13). If BRELLY finds in 
line 14 that there are no locks left alive in locks|l] after a parallel access, it has found 


an unprotected umbrella, and it then reports a discipline violation in lines 15-17. 
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Figure 3-3: A sample execution of the BRELLY algorithm. We restrict our attention to 
the algorithm’s operation on a single location J. In the parse tree, each leaf represents an 
access to | and is labeled with the thread that performs the access (e.g., e,) and the lock 
set of that access (e.g., {A,B}). Umbrellas are enclosed by dashed lines. The table displays 
the values of accessor[l] and locks{I] after each thread’s access. The nonlocker for each lock 
is given in parentheses after the lock, and killed locks are underlined. The “access type” 
column indicates whether the access is a parallel or serial access. A discipline violation is 
reported after the execution of e7, because e7 is a parallel access and no locks are left alive 
in locks|l]. 


When reporting a violation, BRELLY specifies the location /, the current thread 
e, and the thread accessor|l]. It may be that e and accessor|l] hold locks in com- 
mon, in which case the algorithm uses the nonlocker information in lines 16-17 to 
report threads that accessed / without each of these locks. ‘Thus, every violation 
message printed by the algorithm always describes enough information to show that 


the umbrella in question is in fact unprotected. 


Figure 3-3 illustrates how BRELLY works. The umbrella containing threads e1, 
€9, and e3 is protected by lock A but not by lock B, which is reflected in locks[I] after 
thread e3 executes. ‘The umbrella containing e; and eg is protected by B but not by A, 
which is reflected in locks{l] after thread eg executes. During the execution of thread 


ég, A is killed and nonlocker|A] is set to eg, according to the logic in lines 7-10. When 
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€7 executes, B remains as the only lock alive in locks[I] and nonlocker|B] is e4 (due to 
line 2 during e5’s execution). Since e, || e7, lines 11-13 kill B, leaving no locks alive 
in locks{l], properly reflecting the fact that no lock protects the umbrella containing 
threads e, through e7. Consequently, the test in line 14 causes BRELLY to declare a 


violation at this point. 


Correctness of Brelly 


The following two lemma will be helpful in proving the correctness of BRELLY. 


Lemma 6 Suppose a thread e performs a serial access to location | during an execu- 
tion of BRELLY. Then all previously executed accesses to | logically precede e in the 


computation. rT] 


Proof: By transitivity of the < relation, all serial accesses to / that execute before 
e logically precede e. We must also show the same for all parallel accesses to / that 
are executed before e. Now, consider a thread e’ that performs a parallel access to | 
before e executes, and let e” || e’ be the thread stored in accessor[l] when e’ executes 
its parallel access. Since e” is a serial access to / that executes before e, we have 
e” < e. Consequently, we must have e’ < e, because otherwise, by pseudotransitivity 


(Lemma 2) we would have e” || e, a contradiction. rT 


Lemma 7 The BRELLY algorithm maintains the invariant that for any location | 
and lock h € locks|l|, the thread nonlocker|h] is either the initial thread or a thread 


that accessed | without holding h. : 


Proof: There are two cases in which nonlocker|h] is updated. The first is in the 
assignment nonlocker{h] < e in line 10. This update only occurs when the current 
thread e does not hold lock h (line 7). The second case is when a lock’s nonlocker|h| 
is set to accessor(|l] in line 2. If this update occurs during the first access to / in the 


program, then accessor|l] is the initial thread. Otherwise, locks|l] is the set of locks 
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held during an access to | in accessor|l], since locks|l] and accessor(l] are updated 
together to the current lock set H and current thread e, respectively, during a serial 
access (lines 2-5), and neither is updated anywhere else. Thus, if h ¢ locks{l], which 
is the case if nonlocker|h] is being set to accessor|l] in line 2, then accessor|l] did not 


hold lock h during its access to J. rT 


Theorem 8 The BRELLY algorithm detects a violation of the umbrella discipline in 
a computation of a Cilk program running on a given input if and only if a violation 


exists. 


Proof: We first show that BRELLY only detects actual violations of the discipline, 
and then we argue that no violations are missed. In this proof, we denote by locks* [I] 
the set of locks in locks{l] that have TRUE alive flags. 

(=) Suppose that BRELLY detects a violation caused by a thread e, and let 
€9 = accessor|l] when e executes. Since we have €g || e, it follows that p = LCA(€o, e) 
roots an umbrella of accesses to J, because p is a P-node and it has an access to | 
in both subtrees. We shall argue that the lock set U of the umbrella rooted at p is 
empty. Since BRELLY only reports violations when locks*[I] = {}, it suffices to show 
that U C locks*[I] at all times after eg executes. 

Since €9 is a serial access, lines 2-5 cause locks”[I] to be the lock set of eo. At 
this point, we know that U C locks*[I], because U can only contain locks held by 
every access in p’s subtree. Suppose that a lock h is killed (and thus removed from 
locks*|1]), either in line 9 or line 13, when some thread e’ executes a parallel access 
between the times that e9 and e execute. We shall show that in both cases h ¢ U, 
and so U C locks*|I] is maintained. 

In the first case, if thread e’ kills h in line 9, it does not hold h, and thus h ¢ U. 

In the second case, we shall show that w, the thread stored in nonlocker[h] when 
h is killed, is a descendant of p, which implies that h ¢ U, because by Lemma 7, 
w accesses | without the lock h. Assume for the purpose of contradiction that w is 
not a descendant of p. Then, we have LCA(w,é€ 9) = LCA(w,e’), which implies that 


w || eo, because w || e’. Now, consider whether nonlocker|h| was set to w in line 10 
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or in line 2 (not counting when nonlocker[h] is left with its old value in line 2). If 
line 10 sets nonlocker|h] < w, then w must execute before eo, since otherwise, w 
would be a parallel access, and lock h would have been killed in line 9 by w before 
e’ executes. By Lemma 6, we therefore have the contradiction that w ~ eo. If line 2 
sets nonlocker|h] < w, then w performs a serial access, which must be prior to the 
most recent serial access by é€9. By Lemma 6, we once again obtain the contradiction 
that w ~ eo. 

(<=) We now show that if a violation of the umbrella discipline exists, then BRELLY 
detects a violation. If a violation exists, then there must be an unprotected umbrella 
of accesses to a location J. Of these unprotected umbrellas, let JT be a maximal one 
in the sense that JT is not a subtree of another umbrella of accesses to /, and let p be 
the P-node that roots T. The proof focuses on the values of accessor|l| and locks[I] 
just after p’s left subtree executes. 

We first show that at this point, accessor|l] is a left-descendant of p. Assume 
for the purpose of contradiction that accessor|/]| is not a left-descendant of p (and is 
therefore not a descendant of p at all), and let p'’ = LCA(accessor|l], p). We know 
that p’ must be a P-node, since otherwise accessor{l] would have been overwritten in 
line 5 by the first access in p’s left subtree. But then p’ roots an umbrella that is a 
proper superset of T’, contradicting the maximality of T. 

Since accessor|l] belongs to p’s left subtree, no access in p’s right subtree overwrites 
locks{l], as they are all logically in parallel with accessor|l]. Therefore, the accesses 
in p’s right subtree may only kill locks in locks|/]. It suffices to show that by the time 
all accesses in p’s right subtree execute, all locks in locks|l] (if any) have been killed, 
thus causing a race to be declared. Let h be some lock in locks*[I] just after the left 
subtree of p completes. 

Since JT’ is unprotected, an access to / unprotected by h must exist in at least one 
of p’s two subtrees. If some access to | is not protected by h in p’s right subtree, 
then h is killed in line 9. Otherwise, let ejem be the most-recently executed thread 
in p’s left subtree that performs an access to / not protected by h. Let e’ be the 


thread in accessor|l| just after ejep, executes, and let €rign, be the first access to | in 


40 


the right subtree of p. We now show that in each of the following cases, we have 
nonlocker|h] || erigne When right executes, and thus fh is killed in line 13. 

Case 1: Thread e;.f is a serial access. Just after ej.4 executes, we have h ¢ locks[I] 
(by the choice of ej) and accessor[l] = een. Therefore, when h is later placed in 
locks|1] in line 2, nonlocker|h] is set to ete. Thus, we have nonlocker|h] = eteft || eright- 

Case 2: Thread ejep, is a parallel access and h € locks{[I] just before ejen, executes. 
Just after e’ executes, we have h € locks{l] and alive[h]| = TRUE, since h € locks|I] 
when €jef, executes and all accesses to | between e’ and ejey, are parallel and do not 
place locks into locks|l]. By pseudotransitivity (Lemma 2), e’ || ei and ete, || Crignt 
implies e’ || €pigng. Note that e’ must be a descendant of p, since if it were not, T would 
be not be a maximal umbrella of accesses to /. Let e” be the most recently executed 
thread before or equal to een, that kills h. In doing so, e” sets nonlocker|h] < e” in 
line 10. Now, since both e’ and ej, belong to p’s left subtree and e” follows e' in the 
execution order and comes before or is equal to €jen, it must be that e” also belongs 
to p’s left subtree. Consequently, we have nonlocker[h] = e” || eright- 

Case 3: Thread ejef is a parallel access and h ¢ locks|l] just before ejep exe- 
cutes. When h is later added to locks{l], its nonlocker{h] is set to e’. As above, by 
pseudotransitivity, e’ || ere and ezeft || Crigne implies nonlocker[h] = e’ || right: 

In each of these cases, nonlocker|h] || erigne still holds when e€yigh, executes, since 
Clef, by assumption, is the most recent thread to access | without h in p’s left subtree. 


Thus, / is killed in line 13 when e€pigp¢ executes. rT 


Analysis of Brelly 


Theorem 9 On a Cilk program that executes serially in time T, uses V_ shared- 
memory locations, and holds at most k locks simultaneously, the BRELLY algorithm 


runs in O(kT a(V,V)) time and O(kV) space. 


Proof: The total space is dominated by the locks shadow space. For any location J, 
the BRELLY algorithm stores at most k locks in locks[l] at any time, since locks are 


placed in locks|l] only in line 2 and |H| < k. Hence, the total space is O(kV). 
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Each loop in Figure 3-2 takes O(k) time if lock sets are kept in sorted order, 
excluding the checking of nonlocker|h] || e in line 12, which dominates the asymptotic 
running time of the algorithm. The total number of times nonlocker|h] || e is checked 


over the course of the program is at most kT’, requiring O(kT a(V,V)) time. " 
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Chapter 4 


Related Work 


In this chapter, we compare the ALL-SETS and BRELLY algorithms to previous race 
detection algorithms in the literature. Although they may not have made it explicit, 
these past algorithms also detect dag races, and not data races. (Further discussion 
of the difference between the two kinds of races is given in Chapters 5 and 7.) We 
focus on dynamic, on-the-fly debugging tools. On-the-fly tools detect races as the 
program executes and are generally more efficient than postmortem tools, which run 
detection algorithms on program execution traces. 

Figure 4-1 summarizes the comparison of ALL-SETS and BRELLY with previous 
work. (This figure is the figure from Chapter 1 with tighter bounds for ALL-SETS 
and Dinning and Schonberg’s Lock Covers.) The ALL-SETS algorithm is the fastest 
algorithm that precisely detects dag races in programs that use locks. The BRELLY 
algorithm is the fastest algorithm that detects locking discipline violations in fully 
series-parallel programs. 

The original work in this area was the English-Hebrew labeling method proposed 
by Nudler and Rudolph [36]. Their model assumes nested parallelism similar to Cilk’s 
spawn/sync, but does not address programs that use locks.! In order to determine the 
logical relation between threads, each thread is given an English label and a Hebrew 
label. The ith child of a thread is labeled with i appended to its parents’ label, where 


'Nudler and Rudolph do discuss handling explicit synchronization operations between parallel 
threads, but we do not discuss that portion of the algorithm here. 
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Handles 

Algorithm Handles | series- Detects Time per 
locks parallel memory access 

dags 


English-Hebrew 
labeling [36] 


YES | Dag races O(Vt + min(bp, V tp)) 


YES | Dag races O(t) O(t? + Vt) 
(p) 


YES Dag races O(p 


Task 
Recycling [10] 


Z Z Z Z 
oe) oe) oe) oe) 


Offset-span 
Labeling [28] 


O(V + min(bp, Vp)) 


SP-Bacs [14] YES Dag races 


Lock YES YES Dag races 
Covers [11] 
Eraser 
Eraser [39] YES NO discipline 
violations 


ALL-SETS YES YES Dag races | O(L(k + a(V,V))) O(kKLV) 
Umbrella 
BRELLY YES YES discipline O(k alV,V)) O(kV) 
violations 


p = maximum depth of nested parallelism 

¢ = maximum number of logically concurrent threads 

V = number of shared memory locations used 

b = total number of threads in the computation 

k = maximum number of locks held simultaneously 

EL = maximum number of distinct lock sets used to access a location 


Figure 4-1: Comparison of dag-race detection algorithms. This figure gives tighter bounds 
for ALL-SETS and Lock Covers than those given in Figure 1-6. 


i is counted left to right for the English label and right to left for the Hebrew label. 


If each label of e; is less than the corresponding label of e2, then e; ~ eg. 


The length of the labels is O(p), where p is the maximum depth of nested paral- 
lelism. In addition to keeping the labels, the algorithm must keep an “access history” 
for each memory location. An access history is a list containing information on 
which threads have accessed the location. (The access histories are essentially de- 
signed to keep the same information that the Nondeterminator-2 keeps in its shadow 


spaces.) In this case, the access history is a list of pointers to labels. The access 


44 


history for each location may grow as large as the maximum number of logically con- 
current threads t. The reason for this potentially large growth is that all concurrent 
threads that read the location must be noted in the access history. The parallel rela- 
tion || is not transitive. So if two threads e || eg both read a memory location, they 
must both be noted in the access history, because a write to that location by another 


thread e3 must be checked with both e; and eg for dag races. 


If the program uses a total of V shared memory locations, then the algorithm keeps 
O(Vt) pointers in access histories. With reference counting garbage collection, the 
storage for the labels can be bounded by O(Vtp). This storage can also be bounded 
by O(bp), where b is the total number of threads in the execution. Thus, the total 
amount of space used by the English-Hebrew labeling scheme is O(Vt+min(bp, Vtp)). 
At each memory access, the algorithm does O(t) comparisons of size O(p) labels, for 
a time of O(pt). Essentially, then, the algorithm slows down ordinary execution by a 
factor of O(pt).? 

The task recycling algorithm, due to Dinning and Schonberg [10], records more 
information in order to reduce the time to check if two threads are concurrent. Like 
English-Hebrew labeling, the algorithm does not address programs with locks. ‘The 
algorithm uses at most ¢ task identifiers, which it assigns to all the threads. To 
distinguish between multiple threads with the same task id, each thread is a given 
a unique version number for its task. In addition, each currently executing thread e 
maintains a parent vector of size t. The ith entry in this vector denotes the largest 
version number for task 7 that serially precedes e. Thus, determining the logical 
relationship between threads requires only a constant time operation—a vector lookup 
and version number comparison. 

The task recycling algorithm, however, must still keep the O(t) size access history 
for each memory location. Thus, at each access, the algorithm performs O(t) opera- 


tions, each taking O(1) time, for a program slowdown of O(t). Up to t threads may 


2To be fully precise, we should also mention the O(p) time to create and join threads. This 
term can be ignored when compared with the O(pt) operation at each memory access, and anyway 
memory accesses occur much more frequently than thread creation/termination in most programs. 
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require size t parent vectors, so the parents vectors require O(t?) space. The storage 
for parent vectors together with the space for access histories yields O(t? + Vt) total 
space. 

Mellor-Crummey’s offset-span labeling approach [28] reduces the size of access 
histories by keeping ids only for “lowest leftmost” and “lowest rightmost” readers. In 
this way, all dag races can be found, because a write that races with any read also 
races with one of the reads in the access history. The space for the access histories is 
therefore reduced to O(V). To determine concurrency, each thread is assigned a label 
which consists of a sequence of offset-span pairs. The ith child thread is labeled by 
appending the pair [7, s] to its parent’s label, where s is the total number of children 
being created, the span, and 2 is called the offset. The mechanism for thread joining 
is complicated, but the idea is that one of the pairs [o, s] is replaced with [o + s, s]. 
We can check if thread e; precedes e2 by checking if the threads’ labels contain pairs 
(01, 8] and [09, s], respectively, such that 0, mod s = 02 mod s. 

The maximum size of labels is once again O(p), so the space for the labels is 
bounded by O(Vp) (assuming garbage collection). This space is also bounded by 
(bp), so the total space of the algorithm is O(V + min(bp, Vp)). Assuming that mod 
is a constant time operation, the time to check each memory access is just O(p), the 
time to compare two labels. 

The SP-BaGs algorithm [14], as we have seen, uses a variation on Tarjan’s least 
common ancestor algorithm to find the logical relationships between threads. This 
algorithm runs in a(V,V) amortized time per memory access, and its disjoint set 
structure requires O(V) space when reference counting garbage collection is used. 
One key idea of SP-BAGS is that by running the program in a known serial order, the 
size of the access history can be reduced, because the relation || is pseudotransitive 
(Lemma 2). SP-BAGS thus keeps only one reader per access history and so requires 
only O(V) total space. 

So far all of the algorithms we have discussed do not properly handle programs 
with locks. That is, they report as dag races parallel updates, even if those updates 


hold a lock in common. Dinning and Schonberg give a way to extend their previous 
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work to correctly identify dag races in programs with locks [11]. The idea is to keep, 
for every thread id in an access history, the set of locks that were held at the time 
of that access. Accesses that use distinct locksets must all be recorded in the access 
history. Dinning and Schonberg’s Lock Covers algorithm maintains access histories 
of size O(tkL), where k is the maximum number of locks held simultaneously, and 
L is the maximum of the number of distinct lock sets used to access any particular 
location. 

Dinning and Schonberg do not specify how this algorithm should determine con- 
currency. If we assume they use their earlier task recycling algorithm, then concur- 
rency can be determined in O(1) time and O(t?) space. In order to detect dag races, 
the algorithm must also intersect sets of locks; such an intersection requires O(k) 
time (assuming the sets are sorted in some way). The algorithm therefore requires 
O(tk?L) time per memory access and O(t? + tkLV) total space. 

The ALL-SETS algorithm, then, can be seen as a variation of Lock Covers that 
achieves better asymptotic performance by using the same ideas as the original 
Nondeterminator—the disjoint set structure and the pseudotransitivity of ||. As we 
have seen, the algorithm uses O(kKLV) space and O(L(k + a(V,V))) amortized time 
per memory access. 

Savage et al. [39] originally proposed the idea of using a locking discipline for 
race-detection purposes. Their discipline requires that every access to a variable that 
is shared be protected a single lock. Their model does not allow for nested parallelism 
or barriers. Rather, they simply assume that all accesses are in parallel with each 
other.? At each access, the set of locks that is allowed to protect the location being 
accessed is intersected with the currently held set of locks. This operation takes O(k) 
time and requires the access history to hold a lockset of size O(k). So Eraser takes 
O(k) time per memory access, and requires a total of (kV) space. 


The BRELLY algorithm can therefore be seen as an application of the idea of 


3 Actually, Eraser allows for an initial serial “initialization phase” in which a variable may be 
written without being protected by a lock. This phase is assumed to end as soon as an access in 
a different thread occurs. This access itself may constitute a race, but Eraser does not report this 
possibility. 


AT 


locking disciplines to a general series-parallel environment. Its O(k a(V, V)) amortized 
time per memory access and O(kV) space usage are almost equivalent to Eraser’s 
asymptotic bounds. 

Others have proposed detecting races by “piggybacking” on the machine’s cache 
coherence protocol [30, 37]. In principle, such piggybacking is only useful in detecting 
data races that actually occur in an execution. That is, the cache coherence protocol 
can detect when threads that actually run in parallel access the same location. To 
detect races based on logical relationships, these approaches must do extra work 
similar to the other algorithms we have seen. 

Comparing times per memory access is slightly unfair, because SP-BAGS, ALL- 
SETS, and BRELLY all run in series, whereas the other algorithms run in parallel. The 
other algorithms, however, need to add extra locking in order to synchronize between 
updates to the access histories. ‘This synchronization adds extra work to the program 
and may reduce its parallelism as well. Additionally, running the debugger in parallel 
means that if the input program is nondeterministic, then the debugger itself will 
be nondeterministic. This behavior is probably not desirable when debugging, as 
programmers may need to run the debugger several times if they plan on fixing race 


bugs one-by-one. 
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Part Il 


Theory of Nondeterminism 


Chapter 5 


Nondeterminism 


In this chapter, we give a model for defining nondeterminism and use that model to 
define a hierarchy of forms of nondeterminism. The model allows programmers to 
define the specific form of nondeterminism that they care about for any particular 
program. The model is used in Chapters 7 and 8 to precisely explain the guarantees 


of determinacy that the Nondeterminator-2 provides. 


A model for Cilk execution 


In order to describe nondeterministic program executions, we first give a formal mul- 
tithreaded machine model that describes the actual execution of a Cilk program. 
In particular, we explain how a program execution can be viewed as a sequence of 
“instruction instantiations.” 

We can view the abstract execution machine for a multithreaded language as a 
(sequentially consistent [26]) shared memory together with a collection of inter- 
preters. (See [4, 9, 20] for examples of multithreaded implementations similar to 
this model.) Each interpreter contains private state which only it can modify. Part 
of its private state is a program counter, which points to an instruction within the 
code for the program. (We assume that the code is read-only, and so where it resides 
is immaterial.) The state of the multithreaded machine can be viewed as a private 


state vector, consisting of the private interpreter states, together with a shared 
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state vector, consisting of the shared memory. Both state vectors may grow and 
shrink during execution, since new interpreters are created and destroyed, and shared 
memory can be allocated and freed. 

Although a multithreaded execution may proceed in parallel, we consider a seri- 
alization of the execution in which only one interpreter executes at a time, but the 
instructions of the different interpreters may be interleaved.! The initial state of the 
machine consists of a single interpreter whose program counter points to the first 
instruction of the program. At each step, a nondeterministic choice among the cur- 
rent nonblocked interpreters is made, and the instruction pointed to by its program 
counter is executed. 

When an instruction is executed by an interpreter, it maps the current state of 


the multithreaded machine to a new state.? There are eight types of instructions?: 
ALU: Modifies only the state of the interpreter that executes it. 

READ: Loads a value from shared memory into the local interpreter state. 

WRITE: Stores a value into shared memory from the local interpreter state. 


LOCK: Acquires a specified lock (special location in shared memory). Cannot be 


executed unless no other interpreter holds the lock. 
UNLOCK: Releases a specified lock. 


SPAWN: Creates a new interpreter with a specified program counter and local state. 


The new interpreter is a child of the original interpreter. 
SYNC: No-op. Cannot be executed unless the interpreter has no children. 
RETURN: Syncs, then destroys the interpreter. 


'The fact that any parallel execution can be simulated in this fashion is a consequence of our choice 
of sequential consistency as the memory model. The model also assumes that single instructions are 
guaranteed to be atomic by the hardware, which is the case in most modern machine architectures. 

? An instruction can formally be said to be a state to state mapping. This definition means that 
an instruction itself is always deterministic; we do not discuss random number generators or other 
forms of “serial nondeterminism.” 

’Two additional instructions, MALLOC and FREER, are discussed in Chapter 9. 
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In addition to performing one of these actions, executing an instruction typically 
causes an interpreter to modify its program counter to point to the next instruction 
in the program. Only an ALU instruction is allowed to modify the program counter 
to become anything other than the next instruction specified by the program.4 An 
interpreter whose next instruction cannot be executed is said to be blocked. If all 
interpreters are blocked, the machine is deadlocked, and the execution is said to be 
a deadlock execution. 

Additionally, during the execution of a program, we can assign a unique inter- 
preter name to each interpreter, in the following manner. The first interpreter 
is named by some fixed string, say “Interpreter.” At each spawn, an interpreter 
names the newly created child interpreter by appending the number of children it 
has spawned to its own name. For example, the interpreter that is the third child 
spawned from the fourth child of the initial interpreter is named “Interpreter43.” 

When an instruction executes in a run of a program, it has a dynamic effect on 
the state of the machine. To formalize the effect of an instruction execution, we 
define an instantzation of an instruction to be a 3-tuple consisting of an instruction 
I, the shared memory location / on which J operates (if any), and the name of the 
interpreter that executes J. (Technically, this 3-tuple should probably be called a 
partial instantiation, as it does not specify all the values involved in the execution of 
I, but we refer to it as an instantiation for convenience.) By examining the eight types 
of machine instructions, we can see that when an interpreter executes an instruction, 
the instantiation of that instruction is entirely determined by the private state of the 
interpreter. 

We therefore think of an execution of a program to be the sequence of instan- 
tiations resulting from running the machine model on the program. This view of 
executions is precisely the reason we have defined the concept of an instantiation: 
to make it explicit which memory locations are touched by the instructions of an 


execution. This formulation makes it easier to define nondeterminism. 


4Tn other words, the program may not branch on a value in shared memory. It must first read 
that value into private memory, and then issue a branching ALU instruction. 
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A model of nondeterminism 


This section provides a framework for defining forms of nondeterminism, and defines 
a few common nondeterminacy classes. In particular, we give a formal definition of 


what it means for a program to have a data race. 


From the English definition of the word, a program might be called “nondeter- 
ministic” if it produces differing behaviors on different executions. Many forms of 
nondeterminism are possible, however. Nondeterminism may be intended by the pro- 
gram, or it may be an accidental artifact of parallel execution. A program might 
behave nondeterministically “in the middle” of execution but produce a deterministic 


answer. 


Rather than using the term “nondeterministic” ambiguously, it is desirable to dis- 
tinguish between its many forms. Emrath and Padua [13] call a program determi- 
nate if it “always leads to the same results,” or nondeterminate otherwise. They 
further divide these categories into subcategories. They call a program internally 
determinate if the sequence of instructions each thread executes, along with the val- 
ues of the variables used by each instruction, is determinate. If a program’s output is 
determinate, but the program is not internally determinate, Emrath and Padua say 
it is externally determinate. A nondeterminate program is called associatively 
nondeterminate if the nondeterminate output is due only to lack of associativity 


of floating-point operations, or completely nondeterminate otherwise. 


Netzer and Miller [35] use a formal model of program behavior based on Lamport’s 
theory of concurrent systems [27] to define nondeterminism. They are specifically 
concerned with defining race conditions. They define a general race to occur in 
a program when two conflicting memory accesses are not forced to occur in a fixed 
order. The idea is that a general race is a bug in a program that is intended to be 
deterministic. A data race, on the other hand, is a bug in a program that’s intended 


to be nondeterministic, and represents only nonatomic execution of critical sections. 


Netzer and Miller further distinguish both general and data races as being either 


“feasible” or “apparent.” A feasible race is one which could occur in an actual 
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execution of the program. An apparent race is a race that appears when only the 
explicit synchronization of the program is considered. Netzer and Miller say that 
apparent races are approximations to feasible races, and that most race detection 
algorithms implicitly detect apparent races. 

We present our own formal model for defining types of nondeterminism. Our goal 
is twofold. First, we would like to be able to define a framework in which any form 
of nondeterminism can be defined. Rather than defining the particular forms that 
we think are important, our formalization makes it possible to define an unlimited 
number of types of nondeterminism. 

Secondly, our formalism allows us to explain precisely what our proposed race 
detection algorithms do. We discuss program executions at the instruction level, so 
that the model is easy to understand. An instruction has a precisely defined meaning, 
and so may be easier to reason about than a model based on “events.” 

We observe that it does not really make sense to speak of a single execution 
as being nondeterministic, because nondeterminism implies that multiple executions 
produce varying results. Therefore, we define a set of executions as being deterministic 
or nondeterministic. Initially, the set of executions we consider are the executions that 
the program can generate according to the machine model. Later in this thesis, we 
consider other sets of executions as well. 

To define a form of nondeterminism, we define an equivalence relation ~ on exe- 
cutions. Thus, a set of executions V is nondeterministic under ~ if there exists 
executions X,,X2 € ¥ such that X; % Xo. Similarly, V is deterministic under 
~ if X, ~ Xp for all X,, Xo € Xv. 

Using this approach, we can define many forms of nondeterminism. We discuss 


several common possibilities here. Rather than explicitly saying “deterministic under 


” (4 


equivalence relation ~,” we often call such programs “~ deterministic.” 

As Emrath and Padua point out, a program may be deterministic on one input 
but nondeterministic on another. Since we have chosen to define forms of determinacy 
on sets of executions, we are implicitly discussing the determinacy of a program for 


a given input. 
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Final-state determinacy 


Read-permute determinacy 


Location determinacy 


Serial determinacy 


Data race freedom 


Figure 5-1: The hierarchy of determinacy classes. Each oval in the diagram represents 
the set of programs that satisfy a particular definition of determinacy. 


A hierarchy of determinism 


Figure 5-1 shows the hierarchy of determinism (or nondeterminism) that we define. 
This chapter does not formally show the relationships between the different types 
of nondeterminism, but each relationship can either be inferred directly from the 
definition or is shown later in this thesis. 

An execution is sertal equivalent only to itself. Therefore, a program is serzal 
deterministic if it generates only one execution, namely, if it is a serial program. 

Recall that an execution X is defined to be a sequence of instantiations, where an 
instantiation x is a triple ( J, J, 7 ) consisting of instruction J, memory location J, 
and interpreter 7. For such an instantiation, we define the selectors Z, £, and N such 


that Z(x) = I, L(x) = 1, and N(x) = 7. Let us define the location subsequence 
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X |, of location / on execution X to be the subsequence formed by taking all 7; € X 
such that £(2;) = 1.° We also will use 7 to denote a permutation on a set of integers. 
Two executions X = 71%2...%m and Y = yy, yo... Yn are location equivalent if 
the following two conditions hold: 
1. There exists a permutation 7 such that x; = y,(@) for alla €1,2,...,n 


(and hence n = m). 


2. For all memory locations /, we have X|, = Y|j. 


In other words, a location deterministic program is allowed to have operations 
on different memory locations interleaved, but the operations on each individual mem- 
ory location must be serialized in a fixed order. 

We can weaken this definition of determinacy by allowing reads to be permuted. 

Two executions X = 71%...% and Y = yi y2...Yn are read permute equiva- 
lent if both of the following conditions are true: 

1. There exists a permutation 7 such that x; = y,(@) for alla €1,2,...,n 


(and hence n = m). 


2. For all memory locations /, there exists a permutation 7, such that the 


following two conditions hold: 


(a) x; € X|, if and only if yx) € Y |v. 


(b) If Z(a;) or Z(x;) is not a READ instruction for any x;,27; € Xu, 


then i <j > m(i) < m(J). 


A read permute deterministic program, therefore, is allowed to have reads 
of the same memory location permuted around each other, but not around writes to 
that location. Read-permute determinism is what is typically meant by just the word 
“deterministic.” 


Two executions are final state equivalent if both leave the machine in the same 


®Given a sequence X = x1 %2...2%m, another sequence Z = 2122... 2% is a subsequence of X if 
there exists a strictly increasing sequence 7112 ...%,% of indices of X such that for all 7 = 1,2,...k, 
we have xj, = 2;. 
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exact state after completion. Programs that are final state deterministic are also 
called determinate. 

Many more forms of determinacy exist that one might like to define. It might 
be useful to have a concept of “observable determinacy,” meaning that only the 
externally observable state of the machine is determinate. Another possible form of 
determinacy is to allow writes to be permuted when they are part of commutative 
critical sections.° This particular form of determinacy resurfaces in Chapter 8. For 
now, we use our framework to define race conditions formally. Race conditions are of 
particular interest because they can be viewed as a “local” form of nondeterminism. 
Such local properties are usually easier to detect than large properties of the entire 
program. 

A data race exists between two executions X = 71%2...%m and Y = yiyo.-- Yn 
if there exists an integer 7 in the range 1 <i < min(m,n) such that the following four 
conditions hold: 


La Dee iyiecs Us 
Ds L(;) = L(ri41), 
3: Li Yit1 and Vist = Vis 


4. Z(x;) or L(xj41) is a WRITE instruction. 


A program (with input) has a data race if any two of its executions have a data 
race between them. In other words, the program exhibits a data race when it can run 
a fixed sequence of instructions up to the point of the race, and then execute in either 
order two conflicting instructions. This definition captures the idea of “simultaneous” 


conflicting instructions, in light of the fact that the instructions themselves are atomic. 


®Determinacy that allows permutation of commuting critical sections is not the same as final-state 
determinacy, for programs with noncommuting critical sections may still be final-state deterministic. 
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Chapter 6 


Complexity of Race Detection 


Ideally, we would have an algorithm to detect nondeterminacy for each form of non- 
determinacy defined in Chapter 5, and programmers would use whatever algorithm 
best suited their own programs. In most cases, however, precise detection of nonde- 
terminacy is extremely difficult, if not impossible. Precise detection of data races, 
like all nontrivial properties of programs, is undecidable. Furthermore, in this chap- 
ter we show that even in simplified models, detecting data races is computationally 
intractable. We argue that the Nondeterminator-2’s detection of dag races is a com- 


putationally practical approximation to data-race detection. 
Theorem 10 Detection of data races in Cilk programs is undecidable. 


Proof: The proof is similar to the standard programming proof of the undecidability 
of the halting problem. Assume there exists a serial decider has_data_race that takes 
as input a Program P (represented as a string). has_data_race returns TRUE if P has 
a data race, or FALSE if not. 

Consider the program in Figure 6-1. The routine Run_code_with_a_race(), if 
executed, may exhibit a data race. If we pass the DoOpposite program as an ar- 
gument to itself, we obtain a contradiction. For, if has_data_race (DoOpposite) 
returns TRUE, then DoOpposite returns without ever having a data race. If has_ 


data_race(DoOpposite) returns FALSE, then DoOpposite executes Run_code_with_ 


59 


cilk int DoOpposite (Program P) 


{ 
if (has_data_race(P)) 
{ 
return 0; 
} 
else 
at 
spawn Run_code_with_a_race() ; 
sync; 
return 0; 
} 
} 


Figure 6-1: A program used to contradict the existence of a decider for race detection. 


a_race(), and so has a data race. Therefore, the serial decider has_data_race cannot 


exist. a 


The implication of Theorem 10 is that we cannot detect data races exactly at 
compile time. (We typically do not want to take the risk that our compiler may 
run forever.) The question, then, is whether data races can be detected exactly at 
run-time. Running the program does not suddenly turn an undecidable problem into 
a decidable one. Rather, the program itself may still run forever. If we assume that 
the program halts, however, then we may be able to guarantee that a detection tool 
would halt. 

The first observation about this approach is that when the program runs, there 
may be portions of the code that do not execute at all due to the particular scheduling. 
For that code, we are back to the original problem. We can’t statically find races, so 
we need to run that code as well and assume it terminates. Thus, if we assume that 
every scheduling of the program terminates, we may be able to exactly detect data 
races by running all possible schedulings. This approach requires O(T!) time, where 
T is the ordinary execution time of the program. 

This bound, while finite, is far too expensive for a practical debugging tool. The 
next question, then, is whether we can ignore code that is never executed and just 


attempt to detect all data races in the code that gets run at least once. (This idea 
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itself is not very well defined, but this particular discussion is intended to be informal.) 

When critical sections execute, they occur in a particular order, but exactly detect- 
ing data races requires determining whether critical sections “synchronize.” Consider 
the program in Figure 6-2(a). Whether the writes to x in Write1 and Write2 con- 
stitute a data race depends on the behavior of the Unknown1 and Unknown2 routines. 
If Unknown1 and Unknown2 do not affect each other’s control flow, as is the case in 
Figure 6-2(b), then the program has a data race, and the final value of x may be 
either 1 or 2. The code in Figure 6-2(c), however, is also possible. In that code, 
Unknown2 does not complete until Unknown1 runs first. In that case, there is no data 
race, for the assignment x = 2 must always occur after the assignment x = 1. 

In general, Unknown! and Unknown2 could be arbitrary operations. In order to 
detect whether the program in Figure 6-2 has a data race, we must determine whether 
Unknown1 and Unknown2 synchronize each other in some way. This determination can 
be shown to be undecidable in a similar fashion to the earlier undecidability proof. 

One possible simplification is to assume that critical sections always form some 
kind of synchronization operation. We can model this simplification by assuming 
that every critical section is either an increment or a decrement of some “semaphore” 
variable. A semaphore variable may be incremented or decremented, but may never 
become negative. That is, if the semaphore is 0, then a decrement operation must 
wait until the semaphore becomes positive before proceeding. We will refer to this 
requirement as the semaphore constraint. 

In this model, we do not need to discern the behavior of critical sections, as they 
are assumed to be semaphore operations, and so we avoid that undecidable problem. 
We still need to discern which instruction orderings are allowed by the semaphore 
constraint, however. That is, in Figure 6-2(c), the statement x = 2 must always occur 
after x = 1. If, however, there were a third parallel procedure that also incremented 
done, then x = 2 could happen before x = 1, and there would be a data race. 

For a program that runs in time 7, discovering which reorderings of the instruc- 
tions conform to the semaphore constraint can be reduced from a size T graph problem 


that is NP-hard [25]. Thus, even in the simplified case where all critical sections are 
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int x; void Unknown1 () 
int done; { 
Cilk_lockvar A; donet++; 
} 
cilk int main() 
{ void Unknown2() 
done = 0; { 
spawn Write1(); done++; 
spawn Write2(); } 
sync; 
printf ("%d", x); 
return 0; 


cilk void Write1() 
{ 
Sys 
Cilk_lock(A); 
Unknown1 () ; 
Cilk_unlock(A); 


cilk void Write2() 
di 
Cilk_lock(A); 
Unknown2 () ; 
Cilk_unlock(A); 
x = 2; 


(a) (b) 


void Unknown1 () 


{ 
donet++; 
} 
void Unknown2() 
{ 
while (!done) 
1 
/* allow Unknown () 
to acquire A */ 
Cilk_unlock(A); 
Cilk_lock(A); 
} 
} 
(c) 


Figure 6-2: The program in (a) may exhibit a data race on x depending on the behavior 
of Unknown1 and Unknown2. (b) shows an example of these routines that leads to a race on 


x, whereas (c) shows an example that does not. 
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known to be semaphore operations, exact detection of data races is still computation- 
ally infeasible. 

The Nondeterminator-2, therefore, does essentially the opposite: It assumes that 
critical sections do not synchronize each other in any way. In other words, the 
Nondeterminator-2 assumes that locks are being used only to provide atomicity, 
and not to implement synchronization. Thus, for the program in Figure 6-2(c), the 
Nondeterminator-2 reports a data race when there is none. This chapter has shown, 
however, that any computationally practical algorithm cannot be 100 percent accu- 
rate in its race-detection reporting. The precise meaning of the Nondeterminator’s 
race reports is discussed and formalized in the next few chapters. 

An alternate assumption also allows computationally feasible race detection al- 
gorithms. This approach only considers the particular semaphore ordering that is 
exhibited in one execution of the program, rather than attempting to discern other 
orderings. The advantage of this approach is that it only detects true data races. The 
problem, however, is that many data races will be missed when, as is commonly the 


case, critical sections do not synchronize. 
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Chapter 7 


The Dag Execution Model 


We have seen that detection of data races is computationally infeasible, but we 
have also seen that the Nondeterminator can efficiently detect dag races. In this 
chapter, we explain precisely why dag races are not the same thing as data races. 
Since the Nondeterminator-2 detects dag races, this chapter details exactly when the 
Nondeterminator-2 reports bugs that are not data races, and when the tool fails to 
report data races. 

When a Cilk program executes, it generates an associated computation dag.'! The 
idea is that a dag generated by a single execution contains information about other 
possible executions of the program. By examining the dag, we can glean information 
about executions other than the one that was actually run. In other words, the dag 
is an attempt to abstract away from a particular scheduling of threads to processors. 
Hence, the dag contains “logical” relationships rather than “actual” ones. ‘These 
logical relationships, however, only represent the synchronization of the program due 
to parallel control constructs, and not any synchronization that may occur due to the 


operation of critical sections on shared memory. 


'Formally, a computation dag can be constructed from an execution as follows. An initial node is 
created that can be considered to correspond to the initialization of the first interpreter. Whenever 
an interpreter executes an instruction J other than a RETURN, with instantiation x, the interpreter 
creates a new vertex x and adds to the dag an edge y > « from its last executed instantiation y to x. 
If the instruction is a SPAWN, an additional instantiation z is created (representing the initialization 
of the child interpreter), and the edge y + z is added to the dag. If the instruction is a RETURN, 
no new vertex is created, but an edge goes from y to the vertex created by the next SYNC of the 
interpreter’s parent. 
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A scheduling X of a dag G is a topological sort of the dag.* A scheduling is 
legal if, for any two LOCK statements that acquire the same lock, there is an UNLOCK 
of that lock in between them. A dag G’ is said to be a prefix of a dag G, if, for any 
nodes x and y such that x <q y and y € G’, we have x <q y. A partial scheduling 
of G is a legal scheduling of a prefix of G, and if any partial scheduling of G can 
be extended to a scheduling of G, we say that G is deadlock free. Otherwise, G 
has at least one deadlock scheduling, which is a partial scheduling that cannot be 
extended. 

A legal scheduling of a dag, therefore, is an approximation to an execution of the 
program. When a legal scheduling of the dag corresponds to an actual execution 
of the program as defined by the machine model, we say that the scheduling is a 
feasible scheduling; otherwise, it is an infeasible scheduling. 

It may in fact be the case that a legal scheduling of a dag is not feasible, for two 
possible reasons. The first reason is demonstrated by the program and corresponding 
dag in Figure 7-1.° In particular, that dag is generated when bari obtains lock A 
before bar2. Every scheduling of this dag contains the instantiation 7, even though 
it does not occur in every execution. (If bar2 obtains lock A before bar1, then the 
y = 3 statement is never executed. ) SO £901 %gXqX 100 1113X4X5UG6L7X2, for example, is 
a legal scheduling that is not feasible. 

We call this situation the forced program counter anomaly. A scheduling 
of a dag specifies an entire sequence of instantiations. When the machine model 
executes an instantiation, the model also specifies which instruction should next be 
executed by that interpreter. The next instruction executed by that interpreter in 
the dag scheduling, on the other hand, is “forced” to be the next one specified in the 
dag, and so may not match the one chosen by the machine model. 

The other reason that legal schedulings may not be feasible is the forced memory 
24 topological sort X of G is a permutation of the nodes of G that satisfies the constraints of 
the dag; if « < y in G, then x must occur before y in X. 

3Recall that the nodes of a dag are actually instantiations, not instructions. Since each instruction 
is executed only once in this example, we simplify notation by labeling the instructions of the program 


in Figure 7-1 with the instantiations x; they generate. This labeling is a further simplification because 
some lines of the program actually each correspond to multiple machine instantiations. 
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int x; cilk void foo1i( 


int ¥; { 
Cilk_lockvar A; £3: Cilk_lock (A); 
LA: Xt+; 
cilk int main() U5: if (x == 1) 
{ ae y = 3; 
£0: x = 0; £7: Cilk_unlock (A); 
m1: Cilk_lock_init(A); } 
spawn fool(); 
spawn foo2(); cilk void foo2() 
sync; { 
£2: printf ("%d", y); £8: Cilk_lock (A); 
return 0; xg: xt+; 
} T1909:  Cilk_unlock(A); 
Tu: y= 4; 
} 


Figure 7-1: A program that generates a dag that exhibits the forced program counter 
anomaly. The dag shown here is generated when foot acquires lock A before foo2. There 
is a dag race between the highlighted instantiations, but the program has no data race. 


location anomaly. An instantiation contains a shared memory location. In a dag 
scheduling, the sequence of shared memory locations that are read is fixed. This 
sequence may not match the memory locations that would be read, however, if the 
machine model were to execute the same sequence of instructions. For example, an 
instruction might be “read into register 1 the contents of memory that is at the 
address contained in register 2.” In a dag scheduling, the memory location read 
by this instruction’s instantiation is fixed, and may not correspond to the location 


specified by register 2. 


Figure 7-2 shows an example of a program that exhibits the forced memory loca- 
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int x[2]; cilk void bari() 


int *y; { 
int Z; Cilk_lock(A) ; 
Cilk_lockvar A; Z= *y; 
Cilk_unlock (A) ; 
cilk int main() } 
{ 
x[O] = 0; cilk void bar2() 
> crea { 
y=X; Cilk_lock(A) ; 
Cilk_lock_init (A) ; Cry) ++; 
spawn bar1(); Cilk_unlock(A); 
spawn bar2(); } 
sync; 
princi ("sd x); 
return 0; 


Figure 7-2: A program that exhibits the forced memory location anomaly. 


tion anomaly. The memory location that is read in the statement z = *y depends on 
whether bar1 or bar2 obtains lock A first. Any dag for this program, however, has a 
fixed memory location in the instantiation for the read of *y. 

Although dag schedulings do not always correspond to machine executions, we can 
still consider them as executions of a dag execution machine. The dag execution 
machine behaves similarly to the ordinary Cilk execution machine, but the program 
counter of each interpreter is always set to point to the next instruction in the dag, 
and the memory locations read are those specified in the instantiations, rather than by 
the instructions. When viewed as a set of dag execution machine executions, the legal 
schedulings of a dag form either a deterministic or nondeterministic set, according to 
the definitions in Chapter 5. In particular, a dag has a data race if two of its legal 
schedulings have a data race between them. 

By definition, a dag race exists on a computation dag if two logically parallel 
threads access the same memory locations while holding no locks in common, and at 
least one of the threads writes the location. This definition of a dag race is equivalent 
to the definition of a data race on the set of legal schedulings of a dag. For, if 


two parallel threads hold no locks in common, then we can always construct a legal 
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int max; cilk void GetMaxi(int y) 


int x; { 
Cilk_lockvar A; x3: xX = max; 
4: Cilk_lock (A); 
cilk int main() 05: if (y > max) 
{ ies max = y; 
£0: max = QO; £7: Cilk_unlock (A); 
m1: Cilk_lock_init(A); } 
spawn GetMax1(7); 
spawn GetMax1(3); cilk void GetMax2(int y) 
sync; { 
£2: printf ("%d", x); £8: Cilk_lock (A); 
return 0; tg: if (y > max) 
} T10: max = y; 
141:  Cilk_unlock(A); 
} 


Figure 7-3: A program with a data race (on variable max) that may not appear as a dag 
race due to the forced program counter anomaly. The dag shown, generated when GetMax1 
acquires A before GetMax2, does not have a dag race. 


scheduling of the dag by scheduling all of the predecessors of the threads, followed by 
the threads themselves in either order. 

Since dag executions are not always machine executions, it is not surprising that 
dag races do not always correspond to data races in the program. Figure 7-1 shows 
a program that does not exhibit a data race. Indeed, the final value of y is always 4. 
The dag in Figure 7-1, however, exhibits a dag race on y, as the two writes to y are 
logically in parallel, and do not hold any locks in common. 

Additionally, there may be data races in the program that do not appear as dag 


races. Such “missing races” once again may be due to the forced memory location or 
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forced program counter anomalies. Figure 7-3 shows an example of the latter causing 
a data race to be missing from the dag. The program takes the maximum of two 
numbers in parallel, but the writes to the max variable depend on the order in which 
the critical sections are executed. The dag in Figure 7-3, for example, is generated 
by an execution in which GetMax1 obtains lock A before GetMax2. In that dag, the 
potential write of max by GetMax2 does not appear. The dag has no dag races, but 
there is a data race between the write of max in GetMax2 and the read of max done in 
the x = max; statement. The final value of x in this program may be either 0 or 3. 

There is another reason that data races may not appear as dag races that is not 
due to either of the aforementioned anomalies. The reason is that some code may 
never be executed, as discussed in Chapter 6. Figure 7-4 shows a simple example. 
The dag in Figure 7-4, which has no dag races, is generated by an execution where 
WriteX1 obtains lock A before WriteX2. Yet clearly, if the opposite occurred, there 
would later be a race on the variable y. 

In many cases, therefore, dag races are not the same as data races. Since the 
Nondeterminator-2 reports dag races, its reports will not exactly correspond to data 
races. When the computation has a dag race that is not actually a data race, the 
Nondeterminator-2 will report a “false positive.” When the program has a data race 
that does not appear as a dag race in the computation, the Nondeterminator-2 will 
fail to report that race — a “false negative.” 

The Nondeterminator-2 detects dag races because, intuitively, they may some- 
times be the same as data races, as the dag is an approximation to the semantics of 
the program. We would therefore like to answer the question: When is it that dag 


races actually correspond to data races? 
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int x; cilk void WriteX1() 


int y; { 
Cilk_lockvar A; Ls: Cilk_lock (A); 
xe: x = 1; 
cilk int main() U7: Cilk_unlock(A); 
{ } 
xo: x = 0; 
1% y = 0; cilk void WriteX2() 
za: Cilk_lock_init(A); { 
spawn WriteX1(); we: Cilk_lock (A); 
spawn WriteX2(); £9: x = 2; 
sync; 2i9:  Cilk_unlock(A); 
23: if (x == 1) } 
{ 
spawn RaceY(3); cilk void RaceY(int z) 
spawn RaceY(4); { 
sync; LI: y= 2Z; 


} } 
U4: printf ("%d", y); 
return 0; 


Figure 7-4: A program with a data race (on variable y) that may not appear as a dag 
race, because the code that exhibits the race may not be executed. Here we show the 
dag generated when the lock A is obtained first by WriteX1 and then by WriteX2. As the 
instantiation +1, appears nowhere in the dag, there is no dag race. 
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Chapter 8 


Abelian Programs 


In this chapter, we define “abelian” programs and prove that a deadlock-free abelian 
program has a data race if and only if every possible generated dag has a dag race.! 
Furthermore, we show that the absence of dag races in a single computation of a 
deadlock-free abelian program implies that the program, when run on the same input, 
is determinate. Thus, the Nondeterminator-2 can verify that a deadlock-free abelian 
program is determinate for a given input. 

In practice, most programs that use locks in any significant way are not abelian. 
The existence of the class of abelian programs is itself interesting, however. This 
class shows that there is in fact a formal relationship between dag races and data 
races. Furthermore, the guarantee that the Nondeterminator-2 provides for abelian 
programs is a somewhat remarkable result, because programs that use locks are gen- 
erally “inherently nondeterministic;” that is, they are read-permute nondeterministic. 
Nonetheless, abelian read-permute nondeterministic programs can be shown to always 


produce the same final machine state. 


Abelian programs 


The program in Figure 8-1 is an example of an abelian program. The program is 


read-permute nondeterministic, as the updates to x may happen in different orders. 


1Some of the results in this chapter appear in [6]. 
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int x; cilk void UpdateXx1() 


Cilk_lockvar A; { 
Cilk_lock (A); 
cilk int main() x += 2; 
{ Cilk_unlock (A) ; 
x = 0; } 
Cilk_lock_init (A); 
spawn UpdateX1(); cilk void UpdateX2() 
spawn UpdateX2(); { 
sync; Cilk_lock (A); 
printf ("%d", x); x += 3; 
return 0; Cilk_unlock (A); 
} } 


Figure 8-1: An example of an abelian program. This particular program has no data 
races or deadlocks, and so is determinate. 


In other words, x has an “intermediate” value that is nondeterministically either 2 or 
3, but x always ends with a value of 5. 

The critical sections in the program in Figure 8-1 obey the following strict defini- 
tion of commutativity: Two critical sections R, and Rg commute if, beginning with 
any reachable program state S, the execution of R, followed by Ry» yields the same 
state S’ as the execution of Ry followed by R,; and furthermore, in both execution 
orders, each critical section must execute the identical sequence of instructions on 
the identical memory locations.? Thus, not only must the program state remain the 
same, the same accesses to shared memory must occur, although the values returned 


3 


by those accesses may differ.” The program in Figure 8-1 also exhibits “properly 


nested locking.” Locks are properly nested if any thread that acquires a lock A and 
then a lock B releases B before releasing A. We say that a program is abelian if any 


pair of parallel critical sections that are protected by the same lock commute, and all 


It may be the case that even though R, and Rz are in parallel, they cannot appear adjacent in 
any execution, because a lock that is acquired preceding R, and released after R; is also acquired 
by Rp (or vice versa). Therefore, we require the additional technical condition that the execution 
of R; followed by any prefix R4 of Ro generates for R4 the same instructions operating on the same 
locations as executing R alone. This requirement is used in the proof of deadlock in Appendix A. 

3By requiring that the entire machine state S remain the same, we mean that the private states 
of the interpreters that execute R,; and Ry, in addition to the shared memory M, must be the 
same regardless of the execution order of the regions. This requirement implies that any temporary 
variables that are used to store intermediate values should be reset at the end of every critical region, 
in order to satisfy the commutativity definition. In practice, of course, temporary variables that are 
not live at the end of critical regions can be left with nondeterministic values. 
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locks in the program are properly nested. 

The idea that critical sections should commute is natural. A programmer presum- 
ably locks two critical sections with the same lock not only because he intends them 
to be atomic, but because he intends them to “do the same thing” no matter in what 
order they are executed. The programmer’s notion of commutativity is usually less 
restrictive, however, than what our definition allows. First, both execution orders 
of two critical sections may produce distinct program states that the programmer 
nevertheless views as equivalent. Our definition insists that the program states be 
identical. Second, even if they leave identical program states, the two execution or- 
ders may cause different memory locations to be accessed. Our definition demands 
that the same memory locations be accessed. 

In practice, therefore, most programs are not abelian, but abelian programs nev- 
ertheless form a nontrivial class of nondeterministic programs that can be checked for 
determinacy. For example, programs that use locking to accumulate values atomi- 
cally, such as a histogram program, fall into this class. Additionally, all programs that 
don’t use locks at all are abelian. Although abelian programs form an arguably small 
class in practice, the algorithms that we present in this thesis can provide guaran- 
tees of determinacy for abelian programs that are not provided by any other existing 
race-detectors for any class of lock-employing programs. 

The converse of the determinacy guarantee is not true. That is, a program may 
have a data race, but later deterministically overwrite that value, resulting in a de- 
terministic final memory state. Also, once a dag race is found, then later parts of the 
dag may once again exhibit the forced memory location or forced program counter 
anomalies. The guarantee, therefore, is that any computation dag of a deadlock-free 
abelian program at least contains a dag race corresponding to the “first” data race 


of the program (if a data race exists at all). 


Proof of the Nondeterminator-2’s determinacy guarantee 


The proof of the determinacy guarantee centers around “regions” of instantiations, 


which are sequences of instantiations executed by a single interpreter. Precisely, a 


79 


region is either a single instantiation other than a LOCK or UNLOCK instruction, 
or a sequence of instantiations that comprise a critical section (including the LOCK 


and UNLOCK instantiations themselves).4 


Every instantiation belongs to at least one 
region and may belong to many. Since a region is a sequence of instantiations, it is 
determined by a particular execution of the program and not by the program code 
alone. We define the nesting count of a region R to be the maximum number of 
locks that are acquired in R and held simultaneously at some point in R. 

Since we are only concerned with the final memory states of feasible schedulings, 
we define two legal schedulings of G to be equivalent if both are infeasible, or 
both are feasible and have the same final memory state. An alternate definition of 
commutativity, then, is that two regions R, and Ry commute if, beginning with any 
reachable machine state S,, the instantiation sequences R, Ry and R2R, are equivalent. 

The proof of the equivalence of dag race freedom and final-state determinism 
proceeds as follows. Starting with a dag-race free, deadlock-free computation G 
resulting from the execution of an abelian program, we first prove that adjacent 
regions in a legal scheduling of G can be commuted. Second, we show that regions 
that are spread out in a legal scheduling of G can be grouped together. Third, we 
prove that all legal schedulings of G are feasible and yield the same final memory 
state. Finally, we prove that all executions of the abelian program generate the same 


computation and hence the same final memory state. 


Lemma 11 (Reordering) Let G be a dag-race free, deadlock-free computation re- 
sulting from the execution of an abelian program. Let X be some legal scheduling of 
G. If regions R, and Ry appear adjacent in X, i.e, X = XR, RyXo2, and R, || Ro, 
then the two schedulings X,R,R2X_ and X;R2R, Xo are equivalent. 


Proof: We prove the lemma by double induction on the nesting count of the regions. 
Our inductive hypotheses is the theorem as stated for regions R, of nesting count 7 


and regions Ry of nesting count 7. 


‘The instantiations within a critical section must be serially related in the dag, as we disallow 
parallel control constructs while locks are held. 
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Base case: 1 = 0. Then FR is a single instantiation. Since R; and Rp are adjacent 
in X and are parallel, no instantiation of Rz can be guarded by a lock that guards R, 
because any lock held at R, is not released until after Ry. Therefore, since G is dag- 
race free, either R, and R» access different memory locations or R, is a READ and Ry 
does not write to the location read by R,. In either case, the instantiations of each of 
R, and Ry do not affect the behavior of the other, so they can be executed in either 
order without affecting the final memory state. 

Base case: 7 = 0. Symmetric with above. 

Inductive step: In general, R, has nesting count 7 > 1, and is of the form 
LOCK(A)---UNLOCK(A). Ry of count j > 1 has the form LOCK(B) ---UNLOCK(B). 
If A= B, then R, and Ry commute by the definition of abelian. Otherwise, there are 
three possible cases. 

Case 1: Lock A appears in Ry, and lock B appears in R,. This situation cannot 
occur, because it implies that G is not deadlock free, a contradiction. To construct a 
deadlock scheduling, we schedule X, followed by the instantiations of R; up to (but 
not including) the first LOCK(B). Then, we schedule the instantiations of Rz until a 
deadlock is reached, which must occur, since Ry contains a LOCK(A) (although the 
deadlock may occur before this instantiation is reached). 

Case 2: Lock A does not appear in Ry. We start with the sequence X,R,R2Xo 
and commute pieces of R; one at a time with Re: first, the instantiation UNLOCK(A), 
then the (immediate) subregions of R,, and finally the instantiation LOCK(A). The 
instantiations LOCK(A) and UNLOCK(A) commute with Ro, because A does not appear 
anywhere in R2. Each subregion of R; commutes with Ry by the inductive hypothesis, 
because each subregion has lower nesting count than R,. After commuting all of R, 
past Ro, we have an equivalent execution X,R2R, Xo. 


Case 3: Lock B does not appear in R,. Symmetric to Case 2. 2 


Lemma 12 (Region grouping) Let G be a dag-race free, deadlock-free computa- 
tion resulting from the execution of an abelian program. Let X be some legal scheduling 


of G. Then, there exists an equivalent scheduling X' of G in which the instantiations 
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of every region are contiguous. 


Proof: We create X' by grouping the regions in X one at a time. Each grouping 
operation does not destroy the grouping of already grouped regions, so eventually all 
regions are grouped. 

Let R be a noncontiguous region in X that completely overlaps no other noncon- 
tiguous regions in X. Since region R is noncontiguous, other regions parallel with R 
must overlap R in X. We first remove all overlapping regions that have exactly one 
endpoint (an endpoint is the bounding LOCK or UNLOCK of a region) in R, where by 
“in” R, we mean appearing in X between the endpoints of R. We shall show how 
to remove regions that have only their UNLOCK in R. The technique for removing 
regions with only their LOCK in R is symmetric. 

Consider the partially overlapping region S with the leftmost UNLOCK in R. Then 
all subregions of S that have any instantiations inside R are completely inside R and 
are therefore contiguous. We remove S by moving each of its (immediate) subregions 
in R to just left of R using commuting operations. Let S$, be the leftmost subregion 
of S that is also in R. We can commute 5S; with every instruction J to its left until it 
is just past the start of R. There are three cases for the type of instruction J. If J is 
not a LOCK or UNLOCK, it commutes with S$, by Lemma 11 because it is a region in 
parallel with S,. If J = LocK(B) for some lock B, then S; commutes with J, because 
5S; cannot contain LOCK(B) or UNLOCK(B). If J = UNLOCK(B), then there must exist 
a matching LOCK(B) inside R, because S is chosen to be the region with the leftmost 
UNLOCK without a matching LOCK. Since there is a matching LOCK in R, the region 
defined by the LOCK/UNLOCK pair must be contiguous by the choice of R. Therefore, 
we can commute 5S; with this whole region at once using Lemma 11. 

We can continue to commute 5S, to the left until it is just before the start of R. 
Repeat for all other subregions of S, left to right. Finally, the UNLOCK at the end of 
S can be moved to just before R, because no other LOCK or UNLOCK of that same 
lock appears in R up to that UNLOCK. 

Repeat this process for each region overlapping R that has only an UNLOCK in R. 


Then, remove all regions that have only their Lock in R by pushing them to just 
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after R using similar techniques. Finally, when there are no more unmatched LOCK 
or UNLOCK instantiations in R, we can remove any remaining overlapping regions by 
pushing them in either direction to just before or just after R. The region R is now 
contiguous. 

Repeating for each region, we obtain an execution X’ equivalent to X in which 


each region is contiguous. a 


Lemma 13 Let G be a dag-race free, deadlock-free computation resulting from the 
execution of an abelian program. Then every legal scheduling of G is feasible and 


yields the same final memory state. 


Proof: Let X be the execution that generates G. Then X is a feasible scheduling 
of G. We wish to show that any legal scheduling Y of G is feasible. We shall 
construct a set of equivalent schedulings of G that contain the schedulings X and Y, 
thus proving the lemma. 

We construct this set using Lemma 12. Let X’ and Y’ be the schedulings of 
G with contiguous regions that are obtained by applying Lemma 12 to X and Y, 
respectively. From X’ and Y’, we can commute whole regions using Lemma 11 to put 
their threads in the serial depth-first order specified by G, obtaining schedulings X” 
and Y"”. We have X” = Y”, because a computation has only one serial depth-first 
scheduling. Thus, all schedulings X, X’, X” = Y", Y', and Y are equivalent. Since 


X is a feasible scheduling, so is Y, and both have the same final memory state. : 


Lemma 14 Let G be a dag-race free, deadlock-free computation resulting from the 
execution of an abelian program. Then every machine execution of the program (on 


the same input) generates the same dag G. 


Proof: Let X be the original machine execution that generated G. Let Y be an 
arbitrary execution of the same program. Let H be the computation generated by Y, 
and let H; be the prefix of H that is generated by the first 2 instantiations of Y. If 


H; is a prefix of G for all 7, then H = G, proving the lemma. Otherwise, assume for 
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contradiction that ig is the largest value of 7 for which H; is a prefix of G. Suppose 
that the (ip + 1)st instantiation of Y is executed by an interpreter with name 7. We 
shall derive a contradiction through the creation of a new legal scheduling Z of G. 
We construct Z by starting with the first 79 instantiations of Y, and next adding the 
successor of H;, in G that is executed by interpreter 7. We then complete Z by adding, 
one by one, any nonblocked instantiation from the remaining portion of G. One such 
instantiation always exists because G is deadlock free. By Lemma 13, the scheduling 
Z that results is a feasible scheduling of G. We thus have two feasible schedulings that 
are identical in the first io instantiations but that differ in the (ip + 1)st instantiation. 
In both schedulings the (ig + 1)st instantiation is executed by interpreter 7. But, the 
state of the machine is the same in both Y and Z after the first ig instantiations, 
which means that the (i9 + 1)st instantiation must be the same for both, which is a 


contradiction. Z 


Theorem 15 An abelian Cilk program that produces a deadlock-free computation with 


no dag races is determinate. 
Proof: Combine Lemma 13 and Lemma 14. 


Theorem 16 A deadlock-free computation produced by an abelian Cilk program has 


a dag race if and only if the program has a data race. 


Proof: (<=) If a deadlock-free computation has no dag races, then from Lemma 14, 
every machine execution generates the same dag, so every such execution is a schedul- 
ing of that dag. Thus, if two machine executions have a data race between them, then 
there is also a dag race between them, which is a contradiction. 

(=) Let G be a deadlock-free computation of an abelian program with a dag race 
that is generated by an execution X of the program. Say that the dag race occurs 
between instantiations x and y. Let Z = Z,xyZ,_ be a legal scheduling of G in which 
x and y occur adjacently. (Such a scheduling must exist, because x and y are in 
parallel and have no locks in common by definition of dag race, and the computation 


has no deadlocks.) 
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We attempt to use the techniques of Lemma 13 to commute X into the form 
of Z. That is, we commute X into the depth-first scheduling and then commute that 
into Z. We show that each step either succeeds or yields a feasible data race. 

If all steps succeed, then Z is feasible. Since x and y are logically in parallel, they 
execute on different interpreters. Moreover, since the instantiation of an instruction 
depends only on the private state of its interpreter, changing the order of execution 
of Z(x) and Z(y) does not affect the instantiations of either of those instructions. 
Therefore, Z'’ = Z,yx is a feasible partial execution of the program, and so the 
program has a data race between executions Z and Z’. 

The only place where the technique of Lemma 13 can fail is in the base case 
of the proof of Lemma 11, as that is the only portion that depends on dag-race 
freedom. In that case, we have a scheduling X,R,R2X2 that is equivalent to the 
original execution X (and so is feasible), with R; a single instantiation, say x’, and 
R, || Ro. The regions R,; and Ry can successfully be commuted unless 2’ writes a 
location accessed by Ry or x’ reads a location written by Ry. Let y’ be the first such 
conflicting instantiation in Ry. Then, we can commute 2’ until it is adjacent with 
y’, yielding a feasible scheduling Xjx'y'X5. By the same argument as above, X;y'x’ 
is also a feasible (partial) execution of the program, and so the program has a data 
race. A symmetric argument can be made when A» consists of a single instantiation 


by choosing the last conflicting instantiation in R,. : 


All of the results so far have assumed a deadlock-free computation, but in gen- 
eral, a deadlock-free computation is not equivalent to a deadlock-free program. For- 
tunately, the following lemma shows that the programmer does not need to worry 
about this distinction when applying Theorems 15 and 16. The proof of this lemma 
is complicated and so is left to Appendix A. 


Lemma 17 Let G be a dag-race free computation generated by an abelian program. 


G is deadlock free if and only if the program is deadlock free (on the same input). 1 
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Corollary 18 Jf the ALL-SETS algorithm detects no data races in an execution of a 
deadlock-free abelian Cilk program, then the program running on the same input has 


no data races and is determinate. 


Proof: Combine Theorems 3 and 15 and Lemma 17. : 


Corollary 19 If the BRELLY algorithm detects no violations of the umbrella disci- 


pline in an execution of a deadlock-free abelian Cilk program, then the program run 


on the same input has no data races and is determinate. 


Proof: Combine Theorems 5, 8, and 15 and Lemma 17. 2 
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Part III 


Using the Nondeterminator-2 


Chapter 9 


Implementation Issues 


In this chapter, we discuss practical issues surrounding the implementation and use 
of the Nondeterminator-2. We explain how to catch dag races involving the dynamic 
memory allocator, and show that memory cannot be recycled without the risk of miss- 
ing races. We provide some heuristics for reducing the number of false reports that the 
Nondeterminator-2 may produce in the case of nonabelian programs, when dag races 
may not really be data races. Finally, we give some timings of the Nondeterminator-2 
on a selection of Cilk codes, which show that the algorithms roughly conform to their 
theoretical bounds in practice. On all of our sample codes, BRELLY is fast enough 
to be used as an interactive debugger, but ALL-SETS sometimes runs too slow to be 


practical. 


Dynamic memory allocation 


Cilk provides the routines Cilk_malloc() and Cilk_free() to dynamically allocate 
and deallocate shared memory. (We refer to these as single instructions MALLOC and 
FREE in the Cilk machine model.) The first observation is that these routines may in 
fact be involved in data races. For example, a read of *x occurring in parallel with 
the command Cilk_free(x) constitutes a race. If the FREE instruction happens first, 
then the value in *x might become garbage before it is read.! 


'The semantics of FREE allow it to write garbage into memory, although in practice the memory 
really only changes after it is allocated again. 
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The solution is to treat the FREE instruction like a write when it occurs. In other 
words, when a FREE occurs, it is compared with all the past accesses in the shadow 
space to check for races. After the FREE occurs, any later access to the freed memory 
(other than a MALLOC) is an error, regardless of whether the access is a serial or 
parallel access. The FREE instruction therefore puts the special tag FREE_ID into the 


shadow space. If a later access observes this value, a bug is reported. 


Once memory is freed, later accesses to it are always incorrect, regardless of which 
locks are held. Therefore, after memory is freed, the history of which locks have been 
used to access that memory is no longer needed. The Nondeterminator-2’s internal 
memory, which is used to store that information, can thus be deallocated as well. This 
approach maintains the convenient property that the internal memory for storing lock 


sets is only allocated as long as the user’s memory is allocated. 


When memory is about to be allocated via a MALLOC statement, the shadow space 
contains FREE_ID. (The memory allocator is trusted to be correct.) MALLOC simply 
overwrites FREE_ID with the id of whatever thread it’s running in. In this way, false 


positives are not reported when memory is reused. For example, consider two threads 


€1 || €2: 
Thread e; Thread e2 
*x = 5; y = Cilk_malloc(...) 
Cilk_free (x); *y = 6; 


Even though e, || €2, it is possible that in a particular execution, e, runs before 
€2, and that the address returned from Cilk_malloc() and assigned to y is the same 
as the address contained in x. It would therefore appear that there is a dag race 
between the two writes to that address, *x = 5 and *y = 6, as those two writes are 
logically in parallel. The writes do not actually constitute a race, however, because 
if the Cilk_malloc() statement were executed before the Cilk_free(x) statement, 
the memory allocator would assure that Cilk_malloc() would return an address 
different from x. The protocol for the Nondeterminator-2 we have described handles 


this situation correctly, because the Cilk_malloc() statement puts the ID for thread 
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€2 in the shadow space, and so the *y = 6 is a serial access and does not appear to 
be a race with the write to *x. 

This approach also catches races involving the MALLOC statement itself. That 
is, by writing the current thread id into newly allocated memory, dag races can be 
caught if that memory is written in parallel. 

There is, however, a problem with the approach we have described. Consider the 


following example, which is similar to the one above, but in which eg writes *x rather 


than *y. 
Thread e; Thread e2 
*x = 5; y = Cilk_malloc(...) 
Cilk_free(x); *x = 6; 


In this example, there is always a race between the writes to *x. This race may 
be missed, however, if the Cilk_malloc() statement returns the same address as x. 
Then the *x = 6 appears to be writing newly allocated memory, rather than writing 
to the data pointed to by x. 

In order to distinguish this case from the previous one, we need some way to dis- 
tinguish between writing to “*x” and “*y” even though both end up writing the same 
memory location. Making this distinction requires some understanding of the mean- 
ing of the program, rather than just monitoring of the memory locations accessed. 
This sort of alias analysis is typically very difficult, and we do not attempt to do it. 

Rather, our solution is very simple. The Nondeterminator-2 does not recycle 
memory. That is, when the memory allocator is run in debugging mode, it assures 
that Cilk_malloc() never returns memory that has previously been allocated. When 
memory is freed, it is simply left in the free state forever. In this way, memory is 
never aliased, and the problem in the previous example cannot occur. 

The justification for this approach is that in modern machines, memory (and 
virtual address space) is large and cheap. It is acceptable to use a lot of memory 
when debugging; memory is still be recycled when the application is in production 


mode. If users do not have enough memory, they can simply turn this feature of the 
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Nondeterminator-2 off, and go back to recycling memory. In that case, however, the 


dag race in the last example will be missed. 


Reducing false race reports 


As we have seen, some dag races may not correspond to data races if they are arti- 
facts of other races or of noncommutative critical sections. Other researchers have 
attempted to algorithmically identify “first races,” as compared to later artifacts of 
those races [33, 34]. While we do not attempt anything of this magnitude, we do 
implement several tricks that can make the race reports of the Nondeterminator-2 
more manageable for the user. 

The first trick is to avoid reporting the “same” race more than once. When a race 
is reported, we enter all of the involved line numbers and file names into a hash table. 
If we later encounter a race with the same lines in the same files, we don’t report it, 
as it is assumed to be another instance of the same race. This feature is essential 
for making the number of races reported be manageable; without it, a single race, 
executed over and over again, could produce thousands of lines of debugging reports. 

The BRELLY algorithm has an additional problem of reporting multiple races. If 
an unprotected umbrella is discovered, that umbrella may potentially be reported once 
for every access in the umbrella (other than the first one). Rather than reporting all 
of these separately, the BRELLY algorithm should group all the accesses together and 
report them all at once. In some cases, it is possible to determine that some subset 
of the accesses actually constitutes a dag race, and those accesses can be reported in 
preference to the entire umbrella. See [5] for more details. 

When false reports due to infeasible dag races occur, we would like to provide some 
way for the user to inform the Nondeterminator-2 that these races are infeasible, so 
that it can avoid reporting them in future executions. One approach is to allow the 
user to “turn off” the Nondeterminator-2’s memory checking, so that certain memory 
accesses are ignored. User annotation can either be done lexically via a compiler 
pragma or dynamically by setting a global flag. While this approach may reduce 


race reports, it requires users to manually assure themselves that there are no races 
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involving the ignored accesses. 

A solution that requires less verification from the user is to use of fake locks—locks 
that are acquired and released only in debugging mode, as in the implicit R-LOCK 
fake lock. The user can then protect accesses involved in infeasible dag races using 
a common fake lock. Fake locks reduce the number of false reports made by the 
Nondeterminator-2, and they require the user to manually check for data races only 


between critical sections locked by the same fake lock. 


A particularly common cause of false reports is “publishing.” One thread allocates 
a heap object, initializes it, and then “publishes” it by atomically making a field in a 
global data structure point to the new object so that the object is now available to 
other threads. If a logically parallel thread now accesses the object in parallel through 
the global data structure, an infeasible dag race occurs between the initialization of 
the object and the access after it was published. 

Fake locks do not seem to help much with the publishing problem, because it is 
hard for the initializer to know all the other threads that may later access the object, 
and we do not wish to suppress data races among those later accesses. One possible 
solution is to allow users to explicitly put in PUBLISH statements in the program, 
to declare that memory is being published. The effect of a PUBLISH statement is to 
erase the history of past accesses that is contained in the shadow space. Since parallel 
threads were unable to access the memory up to the point of the PUBLISH statement, 
accesses before that statement cannot be involved in races. 

There are some practical difficulties in using PUBLISH in C. The size of struc- 
tures may not be known statically, so the user may be required to supply the size. 
Furthermore, there is no way to specify that structures that are nested via pointers 
are all part of the same “object.” The user must therefore explicitly issue a PUBLISH 
statement for each nested pointer data structure. Publishing of objects could be more 
elegantly handled in a strongly-typed language. A possible solution for C is to use 
checkpointing technology, which is able to automatically trace through entire data 
structures [40]. Even then, the semantics of PUBLISH could be difficult to express if 


only parts of a data structure are being published. 
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Timings of the Nondeterminator-2 


In this section, we give some experimental measurements of the performance of the 
Nondeterminator-2.? As it is a debugging tool, the Nondeterminator-2 does not need 
to achieve absolutely optimal performance. Rather, it just needs to be fast enough 
to use in an interactive debugging environment. 

Our implementations of ALL-SETS and BRELLY have not yet been optimized, 
and so the timings presented here are preliminary; better performance than what we 
report here is likely to be possible. In particular, our current implementation treats 
every READ like an ACCESS with the fake R-LOCK, as described in Chapter 2. This 
approach requires an allocation of a lock set at every read operation. We expect that 
the running time of both algorithms could be greatly improved if we optimized the 
common case of reads with no locks held. 

According to Theorem 4, the factor by which ALL-SETS slows down a program is 
roughly O(Lk) in the worst case, where L is the maximum number of distinct lock sets 
used by the program when accessing any particular location, and k is the maximum 
number of locks held by a thread at one time. According to Theorem 9, the worst-case 
slowdown factor for BRELLY is about O(k). In order to compare our experimental 
results with the theoretical bounds, we characterize our four test programs in terms 
of the parameters k and L:3 

maxflow: A maximum-flow code based on Goldberg’s push-relabel method [17]. 
Each vertex in the graph contains a lock. Parallel threads perform simple operations 
asynchronously on graph edges and vertices. To operate on a vertex u, a thread 
acquires u’s lock, and to operate on an edge (u,v), the thread acquires both w’s lock 
and v’s lock (making sure not to introduce a deadlock). Thus, for this application, the 
maximum number of locks held by a thread is k = 2, and L is at most the maximum 
degree of any vertex. 

n-body: An n-body gravity simulation using the Barnes-Hut algorithm [1]. In 
one phase of the program, parallel threads race to build various parts of an “octtree” 


Some of the results in this section appear in [6]. 
’ These characterizations do not count the implicit fake R-LOCK used by the detection algorithms. 
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Figure 9-1: Timings of our implementations on a variety of programs and inputs, run 
on an UltraSPARC I. (The input parameters are given as sparse/dense and number of 
vertices for maxflow, number of bodies for n-body, number of elements for bucket, and 
iteration number for rad.) The parameter L is the maximum number of distinct lock sets 
used while accessing any particular location, and k is the maximum number of locks held 
simultaneously. Running times for the original optimized code, for ALL-SETS, and for 
BRELLY are given, as well as the slowdowns of ALL-SETS and BRELLY as compared to the 
original running time. 


data structure. Each part is protected by an associated lock, and the first thread to 
acquire that lock builds that part of the structure. As the program never holds more 


than one lock at a time, we have k = L = 1. 


bucket: A bucket sort [8, Section 9.4]. Parallel threads acquire the lock associated 
with a bucket before adding elements to it. This algorithm is analogous to the typical 


way a hash table is accessed in parallel. For this program, we have k = L = 1. 


rad: The radiosity application (discussed further in Chapter 10). The code locks a 
“patch” of the scene along with the “surface” that the patch is on, so that k = 2, and 
L is the maximum number of patches per surface, which increases at each iteration 


as the rendering is refined. 


Figure 9-1 shows the preliminary results of our experiments on the test codes. 
These results indicate that the performance of ALL-SETS is indeed dependent on the 
parameter L. Essentially no performance difference exists between ALL-SETS and 
BRELLY when L = 1, but ALL-SETS gets progressively worse as L increases. On all 


of our test programs, BRELLY runs fast enough to be useful as a debugging tool. In 
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some cases, ALL-SETS is as fast, but in other cases, the overhead of ALL-SETS is too 


extreme (iteration 13 of rad takes over 3.5 hours) to allow interactive debugging. 
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Chapter 10 


Parallel Radiosity 


In this chapter, we describe our experiences parallelizing a large, real-world radios- 
ity application. We view this application as a case study for the usefulness of the 
Nondeterminator-2. We used the Nondeterminator-2 to minimize the amount of the 
radiosity code that we needed to examine and understand. Figure 10-1 shows the 
speedup of our Cilk code as compared to the original optimized C version. With less 
than 5 percent of the code from the original version changed, the entire application 
achieves a speedup of 5.97 on 8 processors. Furthermore, the Nondeterminator-2 gives 


us a high degree of confidence that the code is actually data-race free. 


Goals of parallelizing radiosity 


Radiosity is a graphics algorithm for modeling light in diffuse environments. It is an 
irregular application, and therefore the computation is difficult to balance statically 
across processors. That is, the area where the majority of the CPU time is spent 
depends on the input scene, and varies dynamically as the lighting is calculated. In 
order to get good performance on a parallel machine, the CPU time must be balanced 
evenly across all processors, so that all processors are utilized fully. This balancing 
is difficult to do at compile time when the behavior of the computation is difficult to 
predict. 


Cilk provides a dynamic scheduler which balances tasks across processors using 
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Figure 10-1: Speedup of the rad application on a maze scene as compared to the original 
optimized C code. Measurements were done on an 8-processor 167-MHz UltraSPARC I. 


a provably good work-stealing algorithm [4]. Radiosity, then, is a good test for the 
capabilities of Cilk’s scheduler. Past attempts at parallelizing radiosity have required 
the algorithm to be modified with explicit load balancing [41]. 

In order to effectively test Cilk’s performance, we prefer not to develop our own 
radiosity application. It is somewhat “unfair” to develop such a test by writing code 
that is intentionally designed to work well with Cilk. Also, we would prefer to have 
a known benchmark against which to measure the parallelized code. Speeding up 
our own code by parallelizing it is not convincing, because it might be that serial 
optimizations could perform as well or better. Therefore, a better test is to try to 
parallelize code that was written and optimized by someone else. If we can speed 
up code that has already been optimized by graphics experts, our results clearly 
demonstrate the usefulness of Cilk. 


We therefore downloaded a radiosity application, rad, which was originally written 
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by Bekaert, Suykens de Laet, and Dutre at the Katholieke Universiteit Leuven in 
Belgium [2]. The application is large, consisting of 75 source files and around 25,000 
lines of code. The application is written in C, and every C program is a legal Cilk 
program, so “porting” the application to Cilk required no effort. 

Correctly parallelizing the code, however, is not as trivial. The code was not 
originally written to be parallelized. Although we might expect certain operations to 
be in principle independent, in fact they may use some shared data structures simply 
because the programmer implemented it that way. Such code would result in data 
races if those operations were executed in parallel. 

Ordinarily, in order to parallelize the code without introducing data races, we 
would have to search through the entire code looking for shared data structures. 
The Nondeterminator-2, however, provides an alternate approach. We simply run 
in parallel those operations that we think should in principle be independent. Then 
we run the code through the Nondeterminator, which points us to the places in the 
code where there is unexpected data sharing. We can fix these problems by copying 
the data, or by adding locks. More importantly, we do not need to examine at all 
code that is not flagged by the Nondeterminator-2; we simply leave it as is. In that 
way, we minimize the amount of time we need to spend studying and understanding 
someone else’s code. 

When parallelizing the radiosity application, we took precisely this approach of 
immediately depending on the Nondeterminator-2, although we actually began by 
using the original Nondeterminator and not the Nondeterminator-2. This particular 
application was actually developed in conjunction with the Nondeterminator-2, and 
served as the inspiration for many of that tool’s features. We now illustrate some of 


the details of our effort in order to give a more concrete sense of what was involved. 


The parallelization effort 


The first step was to gain an understanding of the underlying radiosity algorithm, so 
we could figure out what to parallelize. Radiosity is a lighting model that is suited for 


diffuse environments. Light striking a surface is assumed to undergo an ideal diffuse 
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reflection, meaning that it scatters equally in all directions. The diffuse reflection 
assumption is in contrast to ray tracing’s assumption of specular reflection, wherein 
a beam of light is assumed to reflect off a surface in another single beam, with the 
angle of reflection equaling the angle of incidence. 

As in many graphics algorithms, radiosity divides the scene into several small 
“patches.” Each patch 7 has an associated power per unit area B; from which the 
color of the patch i can be determined.’ The idea is that the power leaving patch 7 
is the sum of the power emitted by 7 (if 7 is a light source) and the power reflecting 
off 7 that comes from all of the other patches in the room. This formula leads to the 


following set of linear equations [18]: 
B, = Ey + pi&5 Bj Fi; 


where B; is the power/area of patch i, E; is the emitted power/area of patch i, p; 
is the reflectance of patch 7, and Fj; is the formfactor from 7 to j, the fraction of 
radiant power leaving 7 that arrives at 7. 

We can solve for B; by numerical iteration. The majority of the calculation time, 
however, is not spent in the numerical solution, but rather in the calculation of the 
formfactors Fj;. The formfactor from a point patch 7 to a patch 7 is the fraction of 
a’s hemisphere that 7 occupies. Computing the formfactor F;; thus requires a double 
integral over the points of patch 7 and patch 7. The formfactors are entirely a property 
of the geometry of the scene, and do not depend on lighting. 

As they are calculated directly from the initial geometry of the scene, distinct 
formfactors can be computed in parallel. Since the calculation of formfactors com- 
prises the majority of the execution time of the radiosity algorithm, this parallelization 
should noticeably speed up the entire execution. 

Armed with this knowledge, we searched through the rad code for the calculation 


of the formfactors, and ran them in parallel. We then ran the resulting code through 


' Actually, the color of patches is determined by assigning colors to vertices and then interpolating 
those colors to the rest of the patch, typically with Gourard shading [19]. 
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the Nondeterminator to look for data races. 

Since the code was initially serial, it did not contain any locks, and was there- 
fore abelian. One goal of parallelization is to keep the program abelian “as long as 
possible,” which provides the stronger guarantee for the Nondeterminator-2. (This 
strategy often amounts to avoiding introducing locks for as long as possible.) When 
finally forced to make the program nonabelian, the programmer must be sure to think 
about the implementation more carefully. 

The first dag races we ran across in rad involved global variables. In some cases, 
these globals appeared to be used only for convenience, to avoid passing them around 
as arguments to procedures. We modified the code to pass arguments rather than use 
globals whenever possible. Another common use of global variables we found was just 
for statistical purposes, such as timings. These statistics can either be ignored in the 
parallel execution (i.e. allowed to become garbage values), or they can be updated 
atomically through the use of locks. Such atomic updates are commutative, and so 
preserve the abelian property of the program. 

The rad code does not exactly implement the radiosity algorithm as we have 
described it. The code does not precompute all the formfactors and then solve the 
numerical system, as computing all the formfactors would require too much CPU 
time and memory. Rather, the code interleaves the solution to the system with the 
formfactor calculations. Specifically, it chooses a single patch 7 where the error in the 
B; approximation is the greatest. It then improves the estimate for B; by improving 
its approximation of the formfactors Fj; for all other patches 7. The first few itera- 
tions of the application, shown in Figure 10-2, demonstrate how the code interleaves 
the updating of the patch radiosities with the calculations of the formfactors. This 
algorithm poses some problems for the parallel execution, because separate iterations 
of the numerical solution cannot be run in parallel, as each iteration is dependent on 
the previous one. The calculations of the formfactors from 7 to all the other patches 


j can still be parallelized, however.” 


2Once again, we could have rewritten the code to perform more formfactor calculations at once, 
but then we would have lost the serial optimizations of the original authors. 
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Figure 10-2: The first three iterations of the rad program on a maze scene. At each 
iteration, the program refines its formfactor estimates for the patch where the error is 
greatest. In the first few iterations, the error is greatest near the light sources, so the 
program appears to be “lighting up” the lights one by one. 


The formfactors Fj; are stored in a linked list in a data structure for patch 7. 
Thus, we encountered a dag race on the updates to this list, as formfactors were 
being added in parallel to it. Fortunately, the order in which the formfactors occur in 
the list doesn’t matter, so they can be added in parallel as long as the list insertion 
operations are made to be atomic. We added a lock to each patch data structure for 
this purpose. 

This logic causes the program to be nonabelian, as the order of the nodes in the 
linked list depends on the order of execution of critical sections. Nonetheless, it is 
not hard to argue that the Nondeterminator still catches all dag races involving this 
list. The reason is that the code never reads only part of the list; rather, it always 
reads the entire list at once. Thus, if any writes race with those reads, they race with 
the reads regardless of the order of the elements in the list. 

The next difficulty we encountered in our parallelization was patch refinement. 
When the error in the estimate for the formfactor Fj; is deemed to be too great, 
either patch 7 or j is refined. (The patch refined is the one with the greater surface 
area.) Refinement means that the patch is subdivided into smaller patches in order 
to get more accurate radiosity estimates. 

If two parallel threads attempt to subdivide a patch 7 at once, a data race occurs 
on that patch. It is allowable, however, for either thread to do the actual refinement, 
as long as the refinement is done only once. This logic can be implemented with 


locks. The first thread to acquire the “refinement lock” for the patch performs the 
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subdivision, and the second thread waits on that lock. After the first thread finishes 
the refinement and releases the lock, the second thread acquires the lock but discovers 
that the refinement has already been done, and so does not repeat it. Also, refining 
patch 7 does not destroy i, it merely creates “subpatches” of 7. Therefore, a thread 
e, that refines a patch 7 does not interfere with a parallel thread e2 that calculates 
directly with 7. We can thus have many parallel threads calculating formfactors for 
patch 7, some of which refine 7 and some of which do not, without any data races. 

This protocol, unfortunately, is entirely nonabelian. A single thread creates and 
initializes the subpatches of 7. Many parallel threads read these subpatches, resulting 
in dag races. These dag races are not actually data races, because the locking protocol 
assures that no threads read the subpatches until the “first” thread finishes initializing 
them. This protocol is an example of the “publishing” problem discussed in Chapter 9, 
and false race reports for this protocol can be avoided by annotating the code with 
PUBLISH statements. 

When a patch is refined, the newly created vertices are added to a list stored in 
the patch’s “surface.” A surface is the top-level patch that initially gets created from 
the scene description. Multiple patches can thus have the same surface, so we need to 
create a lock for each surface that is acquired when vertices are added to it. Adding 
vertices to a surface, therefore, is similar to adding formfactors to a patch. 

When we initially ran the code, it appeared to be behaving correctly. We later 
observed, however, that the code was behaving nondeterministically after running for 
about 10 iterations. Investigation of this problem showed that its manifestation was 
that a thread would read from freed memory. This discovery led us to think about the 
mechanism for detecting races with the memory allocator, as discussed in Chapter 9, 
although that turned out not to be the problem in this particular case. 

We had been running the code through the Nondeterminator-2 for only one iter- 
ation, expecting that all dag races would show up there. We were also at that point 
struggling with a large number of false reports (dag races that were not actually data 
races). This difficulty led us to the idea of a hash table to avoid reporting the “same” 


race twice, as discussed in Chapter 9. After implementing that idea, we ran the 
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debugging code for many iterations. Many false reports still showed up in the first 
iteration, but those were not reported again, so that later iterations did not report 
races. The first new race report appeared in iteration 10, and this report pointed us 
to the bug that we had seen. 

At this point, we were able to obtain a reasonable speedup, but we discovered 
that the serial code for patch refinement was taking a lot of the execution time. The 
expensive part of this code is that in order to avoid adding duplicate vertices to the 
surface’s list of vertices, the program must search that list before adding each vertex. 
This search can be parallelized, because searching only requires reading the elements 
of the list, not writing them. 

This parallelization requires an elaborate protocol, which is described in Figure 10- 
3. We first obtain the lock for the list, and record the head pointer of the list. We 
then release the lock and search the rest of the list for the vertex in question. If the 
vertex is found, then we don’t need to add it to the list, so we’re finished. If the 
vertex is not found, then we acquire the lock for the list again in order to add it. 
Other vertices, however, may have been added to the list since the time we began 
our search. We thus must search the beginning of the list, up to the point where we 
began our earlier search, in order to check if a parallel thread has already added the 
vertex in question. If not, then we add the vertex to the front of the list while still 
holding the lock for the list. The idea of this protocol is that the majority of the 
computation time is spent searching the bulk of the list with no locks held, which can 
be done by many threads in parallel.* 

This protocol is once again nonabelian. When vertices are added to the list, they 
are being “published.” False races can thus be avoided by judicious use of PUBLISH 
statements. 

As mentioned above, a “refinement lock” is acquired when a patch is refined. 
The parallelism within patch refinement, therefore, actually occurs while a lock is 


held. This behavior is in theory disallowed, but in reality it does not cause any fatal 


’This code could likely be improved by using a more efficient data structure than a linked list, 
but we do not wish to change the underlying algorithms of the original implementation. 
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SAVE_HEAD 


a 


Step 1. While holding the lock for the list, record the current head pointer of the list 
into a local variable SAVE_HEAD. 


SAVE_HEAD 


r 
a al 


Step 2. Search the list starting at SAVE_HEAD for the particular node in question. If 
the node is not found, go to step 3; otherwise, nothing else need be done. The lock 
for the list is not held during this step, so many searches may occur in parallel. 


HEAD SAVE_HEAD a 


tp oy 


Step 3. Acquire the lock for the list again. Search for the node in question from the 
current head pointer of the list until the node saved in SAVE_HEAD. If the node is not 
found, add it to the front of the list. 


Figure 10-3: The protocol for adding vertices to the surface’s vertex list. Most of the 
searching of the list can be done in parallel. 
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problems. The only danger is that the Nondeterminator-2 may miss races because 
the refinement lock appears to be protecting the parallel accesses when in fact it does 
not. In this particular case, no races are missed, because there are parallel threads 
that operate on different patches but the same surface; these parallel accesses are not 
protected by any single patch refinement lock. 

When we ran the rad application on many processors, we discovered that the form- 
factor calculations and patch refinement were sufficiently fast that other portions of 
the code were becoming bottlenecks. We parallelized two other CPU intensive rou- 
tines. One of these routines calculates the color of vertices from the patch radiosities; 
this color calculation can be done for the entire scene in parallel. The other routine 
performs “T-vertex elimination,” which essentially deallocates memory for certain 


undesirable kinds of vertices. 


Parallelization results 


Timings of the rad routines are given in Figure 10-4. As expected, the formfac- 
tor/patch refinement calculations dominate execution time in the one processor exe- 
cution. Vertex color computation and T-vertex elimination also comprise a sizeable 
portion of the execution time. The rest of the CPU time is labeled “Other,” and 
corresponds to the remaining code, which was not parallelized. This code includes, 
for example, the numerical iterations updating the radiosity values and the hardware 
rendering of the scene to the monitor. As the parallel routines speed up in multipro- 
cessor execution, the formfactor calculation with patch refinement is still the most 
expensive operation, but the time spent in nonparallelized code becomes comparable. 

Figure 10-5 shows these measurements as speedups as compared to the original 
optimized C code. In particular, we observe that the one processor Cilk version 
is negligibly slower than the original C version. The speedup curve for the entire 
parallelized application shows the combination of the running times of the four com- 
ponents given in Figure 10-4. The entire execution achieves a 5.97 times speedup on 


4The added overhead of Cilk procedure calls is balanced by the speedup from Cilk’s fast memory 
allocator. 
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Figure 10-4: Running times of the components of the rad application. Timings were 
done of 100 iterations of the application on the maze scene on an 8-processor 167-MHz 
UltraSPARC I. 


8 processors. 


Additionally, Cilk provides a way to measure the work and critical path of the 
computation. The work T, is the time it takes the Cilk program to execute on a 
single processor. The critical path T,,. is the time it would take to execute the 
program on infinitely many processors. The average parallelism is defined to 
be 7, /T,., and represents a measure of the speedup that the program can obtain. 
When the average parallelism of the program is much greater than the number of 
processors P being used, a theorem shows that Cilk’s scheduler runs the program in 
time approximately T;/P with high probability [4]. The average parallelism of the 
formfactor calculations is measured as 221. Unfortunately, this measurement does 
not account for time spent in contention for user locks; such contention both adds 
work for the program and reduces parallelism. On 8 processors, however, the work is 
only increased by 18 percent, and the average parallelism is around 195. This high 
average parallelism implies that the calculations could be further sped up with more 


than 8 processors. 


103 


7 
6 
—@— Formfactor and patch 

5 refinement calculations 
= —m- Vertex color computation 
34 
2 —a— T-vertex elimination 
i) 

3 —< Entire execution 

2 

1 

0 


1 2 3 4 5 6 7 8 


Number of processors 


Figure 10-5: Speedup of the components of the rad application on the maze scene as 
compared to the original optimized C code. The measurements were taken on an 8-processor 
167-MHz UltraSPARC I. 


Upon examination, we find that our parallelization changed less than 5 percent of 
the total code. We were not required to examine nor understand the majority of the 
code. Furthermore, the Nondeterminator-2 gives us reason to believe that the code is 


data-race free. The combination of Cilk and the Nondeterminator-2 made it practical 


to efficiently and correctly parallelize this large-scale, real-world application. 
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Chapter 11 


Conclusion 


The many challenges to successfully parallelizing programs include expressing the 
parallelism in the program, getting good performance out of parallel hardware, and 
debugging. Cilk is designed to address the first two issues, and in this thesis, we 
have addressed the third. We presented the ALL-SETS and BRELLY algorithms for 
finding dag races, and explained how those races relate to the semantics of the pro- 
gram. We showed how these tools were used to parallelize a large, real-world radiosity 
application. 

Although the Nondeterminator-2 is an efficient tool for race-detection, many issues 
surrounding its use remain unresolved. A key decision by Cilk programmers is whether 
to adopt the umbrella locking discipline. Programmers might first debug with ALL- 
SETS, but unless they have adopted the umbrella discipline, they will be unable to 
fall back on BRELLY if ALL-SETS seems too slow. We recommend that programmers 
use the umbrella discipline initially, which is good programming practice in any event, 
and only use ALL-SETS if they are forced to drop the discipline. 

Even when using ALL-SETS, users can encounter false positives and false nega- 
tives from the Nondeterminator-2 when their programs are nonabelian. It is an open 
question whether there are other classes of programs (besides abelian programs) for 
which the Nondeterminator-2 can provide guarantees of determinacy. If we examine 
the proof in Chapter 8, we find that we don’t actually need the strong requirement 


of commutativity that each of two critical sections must execute the “identical se- 
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quence of instructions on the identical memory locations” in either order of execution. 
Rather, it is only necessary that each critical section read and write the same set of 
memory locations in either execution order, and also that in either execution order 
each critical section acquire the same locks in the same order. Thus, we may be able 
to consider as abelian some programs that are not formally abelian by the definition 
given in Chapter 8. 

This generalization of the definition of abelian has implications for nonabelian 
programs as well; it could provide an approach to avoid some of the false negative 
problems discussed in Chapter 7. It may be possible for a compiler to conservatively 
estimate the memory locations touched by critical sections. Thus, even if a critical 
section does not happen to touch all of those locations in a given computation, we may 
be able to find dag races in other computations using those conservative estimates. 

In Chapter 10, we argued that although the process of adding nodes to a linked 
list in parallel is nonabelian, in practice the Nondeterminator-2 does not miss races, 
because the order of the nodes in the linked list doesn’t matter. It may be possible 
to prove such a claim by proving that the code operates on the same set of memory 
locations regardless of the order of the nodes in the linked list. 

The techniques we have presented for reducing the number of false race reports in 
nonabelian programs are at best imperfect. It would be preferable to have a “higher 
level” language construct for annotating code than PUBLISH, which requires the user 
to be explicitly aware of the exact memory locations being published. Furthermore, 
in some cases PUBLISH does not properly convey the semantics of the user’s code. 
The user may in fact be using critical sections to synchronize the entire program, and 
not to publish any particular memory. Such semantics might be better handled by 
introducing other language constructs into Cilk that precisely express the synchro- 
nization semantics intended. A preferable solution is probably to once again allow the 
user to annotate the code, expressing the fact that certain critical regions actually 
synchronize the program. In order to properly handle such directives, we need to 


extend the SP-BAGS algorithm to graphs that are not series-parallel. 


Missed races and false reports are not a problem when the program being debugged 
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is abelian, but programmers would like to know whether an ostensibly abelian pro- 
gram is actually abelian. Dinning and Schonberg give a conservative compile-time 
algorithm to check if a program is “internally deterministic” [11], and we have given 
thought as to how the abelian property might likewise be conservatively checked. The 
parallelizing compiler techniques of Rinard and Diniz [38] may be applicable. 

The guarantee of the Nondeterminator-2 for abelian programs requires that the 
program be deadlock-free, which is left to the user to verify. We would prefer to have 
a way of checking if a program, or at least a computation of a program, is deadlock 
free. While this problem in general appears difficult, there may be a reasonable, 
flexible locking discipline that precludes deadlocks and that allows efficient detection. 

Although we believe that the Nondeterminator-2 is a useful tool, we have the unfair 
advantage of having developed it. Other programmers may not want to take the time 
to learn how to use the tool. Past experience has shown that many programmers 
assume that their program is correct if they run it several times without failures. 
Will such programmers be willing to try out a debugging tool that may only produce 
false race reports anyway? The answer remains to be seen, but from our experience 
we know that correct parallelization is hard, and we believe that any user would be 


well advised to take the time to learn how to debug with the Nondeterminator-2. 
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Appendix A 


Deadlock in the Computation 


In this appendix, we give a proof of Lemma 17.‘ This lemma shows that for abelian 
programs, a deadlock in a dag-race free computation corresponds exactly to a deadlock 
in the program. 

Ideally, we would have an algorithm that checks for deadlocks in a computation 
dag. Users would run this algorithm along with ALL-SETS or BRELLY to directly 
use the results of Theorems 15 and 16. Since we do not (yet) have an efficient 
algorithm to detect deadlocks in a dag, however, using Theorems 15 and 16 directly 
requires users to manually examine computation dags for deadlocks. Users, however, 
presumably don’t really care about deadlocks in computations; they care whether 
their programs can actually deadlock. Fortunately, Lemma 17 shows that checking an 
abelian program for deadlocks is equivalent to checking any dag-race free computation 
of that program for deadlocks. 

In our current formulation, proving that a deadlock scheduling of a computation 
is feasible is not sufficient to show that the machine actually deadlocks. A deadlock 
scheduling is one that cannot be extended in the computation, but it may be pos- 
sible for the machine to extend the execution if the next machine instruction does 
not correspond to one of the possibilities in the dag. In this appendix, in order 


to prove machine deadlocks, we think of a LOCK instruction as being composed of 


!This proof is joint work with Keith Randall. 
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two instructions: LOCK_ATTEMPT and LOCK_SUCCEED. Every two LOCK_SUCCEED 
instantiations that acquire the same lock must be separated by an UNLOCK of that 
lock, but multiple LOCK_ATTEMPT instantiations for the same lock can be executed by 
different interpreters in arbitrary order. In other words, LOCK_ATTEMPT instructions 
can always be executed by the interpreter, but LOCK_SUCCEED instructions cannot 
be executed unless no other interpreter holds the lock. If an interpreter executes a 
LOCK_ATTEMPT instruction, the next instruction executed by the interpreter must 
be a LOCK_SUCCEED instruction for the same lock. A feasible deadlock scheduling 
is therefore an actual machine deadlock, because the LOCK_SUCCEED instantiations 
that come next in the dag are always the same as the next possible instantiations for 


the machine. 


A LOCK_ATTEMPT instantiation commutes with any other parallel instantiation. 
For convenience, we still use the single instantiation LOCK to mean the sequence 


LOCK_ATTEMPT LOCK_SUCCEED. 


It is the proof of Lemma 17 that requires the extra technical condition on com- 
mutativity that is mentioned in Chapter 8, which we call prefix commutativity: 
essentially, a prefix of a region locked by the same lock as a complete region must 
“commute” with the complete region. Precisely, given a partial scheduling X, two 
parallel regions R, and Rz that are surrounded by the same lock, and FR a prefix 
of Ro, then X RR) being feasible implies that XR is feasible. The reason for this 
requirement is that it may be the case that it is never possible for two complete re- 
gions to execute adjacent to each other. An example is shown in Figure A-1. In that 
program, it is never possible for the two regions that lock the lock B (lines 11-17 and 
20-26) to execute adjacent to each other, because those regions each acquire locks 
that are held by the other thread. Therefore, without the requirement of prefix com- 
mutativity those regions would not be required to commute in any way. It is possible, 
however, to execute one entire region, say lines 11-17, and then a prefix of the other, 
namely from line 20 up to the LOCK_ATTEMPT(A) in line 23. Prefix commutativity 
requires that this prefix consist of the same instantiations as if it were executed before 


the complete region in lines 11-17. The code for the program in Figure A-1 does not 
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int x; cilk void foo1l() cilk void foo2() 


Cilk_lockvar A; { { 
Cilk_lockvar B; 10: Cilk_lock (A) ; 19: Cilk_lock(C); 
Cilk_lockvar C; 11: Cilk_lock(B); 20: Cilk_lock(B); 
12: xt++; 21: xt++; 
cilk double main() 13: if (x==2) 22: if (x==2) 
{ 
ae x = 0; 14: Cilk_lock(C); 23: Cilk_lock (A); 
2: Cilk_lock_init (A); 15s x= 5; 24: x = 6; 
33 Cilk_lock_init(B) ; 16: Cilk_unlock(C) ; 25: Cilk_unlock (A); 
4: Cilk_lock_init(C); 
5: spawn fool(); 17: Cilk_unlock (B) ; 26: Cilk_unlock(B) ; 
6: spawn foo2(); 18: Cilk_unlock (A) ; 27: Cilk_unlock(C); 
T: sync; } } 
8: printf ("%d", x); 
9: return 0; 
} 


Figure A-1: A program that illustrates the need for the prefix commutativity requirement. 
The program does not deadlock; of the two LOCK(B) --: UNLOCK(B) regions (lines 11-17 and 
20-26), only the second one to execute acquires another lock (either A or Cc). Furthermore, 
those regions can never execute entirely adjacent to each other, for the second one to execute 
must wait for the entire other thread to complete. This program does not have any data (or 
dag) races, but it may produce a final value of x as either 5 or 6. The prefix commutativity 
requirement means that this program is not considered to be abelian, because the prefix in 
lines 20-23 does not “commute” with the complete region in lines 11-17. 


satisfy this requirement, and so the program is not abelian. In particular, we observe 
that the program uses special logic to avoid the possibility of deadlock. The pre- 
fix commutativity requirement allows us to prove that when parallel regions cannot 
actually occur adjacent in an execution, then the program must contain a deadlock. 

To prove Lemma 17, we first introduce new versions of Lemmas 11, 12, and 13 
that assume a deadlock-free program instead of a deadlock-free dag. We then use 


these modified versions to prove Lemma 17. 


Lemma 20 (Reordering) Let G be a dag-race free computation resulting from the 
execution of a deadlock-free abelian program, and let Ry and Ry be two parallel regions 


in G. Then: 


1. Let X be a partial scheduling of G of the form X,R,R2Xo. The partial scheduling 
X and the partial scheduling X;R2R,X2 are equivalent. 


2. Let Y be a feasible partial scheduling of G of the form Y = Y,R,R}, where R5 
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is a prefix of Ra. Then then the partial scheduling Y, Ri is feasible. 


Proof: We prove the lemma by double induction on the nesting count of the regions. 
Our inductive hypothesis is the theorem as stated for regions R, of nesting count 7 
and regions Ry of nesting count j7. The proofs for part 1 and part 2 are similar, so 
sometimes we will prove part 1 and provide the modifications needed for part 2 in 
parentheses. 

Base case: i = 0. Then R, is a single instantiation. Since R, and Ry (4) are 
parallel and are adjacent in X (Y), no instantiation of Ry (R45) can be guarded by a 
lock that guards R, because any lock held at Rj is not released until after Ry (R34). 
Therefore, since G' is dag-race free, either R, and Ry (R4) access different memory 
locations or Ry is a READ and Ry (4) does not write to the location read by R,. In 
either case, the instantiations of each of Ry and Ry (4) do not affect the behavior of 
the other, so they can be executed in either order without affecting the final memory 
state. 

Base case: 7 = 0. Symmetric with above. 

Inductive step: In general, R, has nesting count 7 > 1, and is of the form 
LOCK(A)---UNLOCK(A). Ry of count j > 1 has the form LOCK(B) ---UNLOCK(B). 
If A= B, then R,; and Ry commute by the definition of abelian. Part 1 then follows 
from the definition of commutativity, and part 2 follows from prefix commutativity. 
Otherwise, there are three possible cases. 

Case 1: Lock A does not appear in Ry» (24). For part 1, we start with the sequence 
X,R,R2X2 and commute pieces of R, one at a time with Ro: first, the instantia- 
tion UNLOCK(A), then the immediate subregions of R), and finally the instantiation 
LOCK(A). The instantiations LOCK(A) and UNLOCK(A) commute with R, because 
A does not appear anywhere in Ry. Each subregion of R,; commutes with Ry by the 
inductive hypothesis, because each subregion has lower nesting count than R,. After 
commuting all of R; past R2, we have an equivalent execution X;R2R,X9. For part 2, 
the same procedure can be used to drop pieces of A, in the feasible partial schedule 
Y,R,R until the feasible partial schedule Y|.R4 is reached. 


Case 2: Lock B does not appear in R,. The argument for part 1 is symmetric with 


1d? 


Case 1. For part 2, we break R4 into its constituents: Rf = LOCK(B) R21... RonR$, 
where Rj; through R2,, are complete regions, and R§ is a prefix of a region. The 
instantiation LOCK(B) commutes with R, because B does not appear in R,, and 
the complete regions Rj, through Ry, commute with R, by induction. From the 
schedule Y;LOCK(B) Roi... Roni RJ, we again apply the inductive hypothesis to 
drop Ri, which proves that Y;LOCK(B)Ro1...RonR5 = YR is feasible. 

Case 3: Lock A appears in Ry (24), and lock B appears in R). For part 1, if both 
schedulings X,R,R2X2 and X,R2R,X>2 are infeasible, then we are done. Otherwise, 
we prove a contradiction by showing that the program can deadlock. Without loss of 
generality, let the scheduling X,R,R2X2 bea feasible scheduling. Because X,R,R2X2 
is a feasible scheduling, the partial scheduling XR, Rz is feasible as well. 

We now continue the proof for both parts of the lemma. Let a, be the prefix of R, 
up to (and including) the first LOCK_ATTEMPT(B) instantiation, let 3, be the rest of 
Ry, and let az be the prefix of Ry (4) up to (and including) the first LOCK_ATTEMPT 
of a lock acquired in Ry (4) that is acquired but not released in a,. At least one 
such lock exists, namely A, so Qg is not all of Ry (R34). 

We show that the partial scheduling X,a1qQ2 is also feasible. This partial schedul- 
ing, however, cannot be completed to a full scheduling of the program because a; 
and a, each hold the lock that the other is attempting to acquire. 

We prove the partial scheduling X,a1Q2 is feasible by starting with the feasible par- 
tial scheduling X,R,a2 = X10;,Q2 and dropping complete subregions and unpaired 
unlocks in (; from in front of a2. The sequence (, has two types of instantiations, 
those in regions completely contained in 3,, and unpaired unlocks. 

Unpaired unlocks in 3, must have their matching lock in a,, so that lock does 
not appear in @2 by construction. Therefore, an unlock instantiation just before az 
commutes with a2 and thus can be dropped from the schedule. Any complete region 
just before az can be dropped by the inductive hypothesis. When we have dropped 
all instantiations in 3,, we obtain the feasible partial scheduling X,a,a@2 which cannot 


be completed, and hence the program has a deadlock. : 
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Lemma 21 (Region grouping) Let G be a dag-race free computation generated by 
the execution of a deadlock-free abelian program. Let X,X Xo be a scheduling of G, 
for some instantiation sequences X,, X, and Xz. Then, there exists an instantiation 
sequence X’ such that X,X'X» is equivalent to X,X Xo and every region entirely 


contained in X' is contiguous. 


Proof: As a first step, we create X” by commuting each LOCK_ATTEMPT in X to 
immediately before the corresponding LOCK_SUCCEED. In this way, every complete 
region begins with a LOCK instantiation. If there is no corresponding LOCK_SUCCEED 
in X, we commute the LOCK_ATTEMPT instantiation to the end of X”. 

Next, we create our desired X’ by grouping all the complete regions in X” one 
at a time. This can be done using identical techniques to the proof of Lemma 12, 


applying Lemma 20 in place of Lemma 11. rT] 


Lemma 22 Let G be a dag-race free computation resulting from the execution of a 
deadlock-free abelian program. Then every legal scheduling of G is feasible and yields 


the same final memory state. 


Proof: The proof is identical to the proof of Lemma 13, using the Reordering and 


Region Grouping lemmas from this appendix in place of those from Chapter 8. 2 


We restate and then prove Lemma 17. 


Lemma 17 Let G be a dag-race free computation generated by an abelian program. 


G is deadlock free if and only if the program is deadlock free (on the same input). 


Proof: (<) If G is deadlock free, then every machine execution of the program is a 

scheduling of G by Lemma 14, so the machine cannot have a deadlock execution. 
(=) By contradiction. Assume that a deadlock-free abelian program P can gener- 

ate a dag-race free computation G that has a deadlock. We show that P can deadlock, 


which is a contradiction. 
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The proof has two parts. In the first part, we generate a feasible scheduling Y of G 
that is “almost” a deadlock scheduling. Then, we show that Y can be modified slightly 
to generate a deadlock scheduling that is also feasible, which proves the contradiction. 

Every deadlock scheduling contains a set of threads ej), €2,...en, some of which 
are completed and some of which are not. Each thread e; has a depth, which 
is the length of the longest path in G from the initial node to the last instan- 
tiation in e;. We can define the depth of a deadlock scheduling as the n-tuple 
(depth(e,), depth(e2),...,depth(e,)), where we order the e; such that depth(e:) > 
depth(e2) >... > depth(e,). Depths of deadlocked schedulings are compared in the 
dictionary order.? 

We generate the scheduling Y of G which is almost a deadlock scheduling by 
modifying a particular deadlock scheduling of G. We choose the deadlock scheduling 
X from which we will create the scheduling Y to have the maximum depth of any 
deadlock scheduling of G. 

Let us examine the structure of X in relation to G. The deadlock scheduling X di- 
vides G into a set of completely executed threads, X,, a set of unexecuted threads Xo, 
and a set of partially executed threads T = {t,,...,tn}, which are the threads whose 
last executed instantiation in the deadlock scheduling is a LOCK_ATTEMPT. We divide 
each of the threads in T into two pieces. Let A = {a1,...,@,} be the parts of the ¢; 
up to and including the last executed instantiation, and let B = {(,,...,8,} be the 
rest of the instantiations of the t;. We say that a; blocks (3; if the first instantiation 
in 3; is a LOCK_SUCCEED on a lock that is acquired but not released by aj. 

X is a deadlock scheduling containing the instantiations in X,U A. To isolate the 
effect of the incomplete regions in A, we construct the legal scheduling X' which first 
schedules all of the instantiations in X, in the same order as they appear in X, and 
then all of the instantiations in A in the same order as they appear in X. 


The first instantiations of the G; cannot be scheduled in X’ because they blocked 


?The dictionary order <p is a partial order on tuples that can be defined as follows: The size 
0 tuple is less than any other tuple. (i;,42,-..,4m) <p (j1,ja,---;jn) if 41 < ji or if 41 = jy and 
(iz, %3, cece pla) <D (j25935 avai yIreh 
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by some a;. We now prove that the blocking relation is a bijection. Certainly, a 
particular @; can only be blocked by one a;. Suppose there exists an a; blocking 
two or more threads in B. Then by the pigeonhole principle some thread a; blocks 
no threads in B. This contradicts that fact that X has maximum depth, because 
the deadlock scheduling obtained by scheduling the sequence Xt,, all subsequently 
runnable threads in X2 in any order, and then the n — 1 partial threads in A — {a;} 


is a deadlock scheduling with a greater depth than X. 


Without loss of generality, let ag be a thread in A with a deepest last instantiation. 
Since the blocking relation is a bijection, only one thread blocks 3; without loss of 
generality, let it be a,;. Break a, up into two parts, a, = ava, where the first 
instantiation of af attempts to acquire the lock that blocks (5. (af may be empty.) 
To construct a legal schedule, we start with X’ and remove the instantiations in af 
from X'. The result is still a legal scheduling because we did not remove any UNLOCK 
without also removing its matching LOCK. We then schedule the first instantiation 
of G2, which we know is legal because we just unblocked it. We then complete the 
scheduling of the threads in T by scheduling the remaining instantiations in T (a{* and 
all instantiations in B except for the first one in 32). We know that such a scheduling 
exists, because if it didn’t, then there would be a deeper deadlock schedule (because 
we executed one additional instantiation from (4, the deepest incomplete thread, 
and we didn’t remove any completed threads). We finish off this legal scheduling by 
completing X2 in topological sort order. 

As a result, the constructed schedule consists of four pieces, which we call Yi, Y2, 
YZ, and Y,. The sequence Yj is some scheduling of the instantiations in X,, Y2 is some 
scheduling of the instantiations in af’ UagU...Ua_, Yz is some scheduling of the 
instantiations in af U 6, U...U Bp, and Y, is some scheduling of the instantiations 
in X2. To construct Y, we first group the complete regions in Y; using Lemma 21 
to get Y3, and then define Y to be the schedule Y,Y2Y3Y4. Since Y is a (complete) 
scheduling of G, it is feasible by Lemma 22. 

The feasible scheduling Y is almost the same as the deadlock scheduling X, except 


a? is not in the right place. We further subdivide a? into two pieces, a? = aiat, 


116 


where a’, is the maximum prefix of a” that contains no LOCK_SUCCEED instantiations 
of locks that are held but not released by the instantiations in af,a2,...,Qn. (Such 
an a, must exist in a/ by choice of af, and furthermore a} is contiguous in Y because 
(, completes the region started at a, and both (3, and a‘, are part of Y3.) We now 
drop all instantiations after a‘, to make a partial scheduling. We then commute 
a, to the beginning of Y3, dropping instantiations as we go, to form the feasible 
scheduling Y,Y2a;. Two types of instantiations are in front of a‘. Complete regions 
before a‘, are contiguous and can be dropped using Lemma 20. Unlock instantiations 
can be dropped from in front of a‘, because they are unlocks of some lock acquired in 
ah, Q2,-.-,Q@,, which do not appear in a‘, by construction. By dropping instantiations, 
we arrive at the feasible scheduling Y,Y2a‘,, which is a deadlock scheduling, as every 


thread is blocked. This completes the proof. . 


Lit 
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